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Consider a two-class clustering problem where we observe X; = 
tig + Zi, Zi N(0,lp), 1 < i < n. The feature vector g £ R v is 
unknown but is presumably sparse. The class labels ti £ { — 1,1} are 
also unknown and the main interest is to estimate them. 

We are interested in the statistical limits. In the two-dimensional 
phase space calibrating the rarity and strengths of useful features, we 
find the precise demarcation for the Region of Impossibility and Re¬ 
gion of Possibility. In the former, useful features are too rare/weak for 
successful clustering. In the latter, useful features are strong enough 
to allow successful clustering. The results are extended to the case of 
colored noise using Le Cam’s idea on comparison of experiments. 

We also extend the study on statistical limits for clustering to that 
for signal recovery and that for global testing. We compare the sta¬ 
tistical limits for three problems and expose some interesting insight. 

We propose classical PCA and Important Features PCA (IF-PCA) 
for clustering. For a threshold t > 0, IF-PCA clusters by applying 
classical PCA to all columns of X with an i 2 -norm larger than t. We 
also propose two aggregation methods. For any parameter in the Re¬ 
gion of Possibility, some of these methods yield successful clustering. 

We discover a phase transition for IF-PCA. For any threshold 
t > 0, let be the first left singular vector of the post-selection 
data matrix. The phase space partitions into two different regions. In 
one region, there is a t such that cos(£^,£) -A 1 and IF-PCA yields 
successful clustering. In the other, cos(£^,Z) < Co < 1 for all t > 0. 

Our results require delicate analysis, especially on post-selection 
Random Matrix Theory and on lower bound arguments. 


1. Introduction. Motivated by the interest on gene microarray study, 
we consider a clustering problem where we have n subjects from two different 
classes (e.g., normal and diseased), measured on the same set of p features 
(i.e., gene expression level). To facilitate the analysis, we assume that two 
classes are equally likely so the class labels satisfy 

(1.1) li ~ 2Bernoulli(l/2) - 1, 1 < i < n. 
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We also assume that the /^-dimensional data vectors X t 's are standardized, 
so that for a contrast mean vector p E R p , 

(1.2) X i =£ i p + Z i , Zi~N(0,I p ), 1 < i < n. 

Throughout this paper, we call feature j, 1 < j < p, a “useless feature” or 
“noise” if p(j ) = 0 and a “useful feature” or “signal” otherwise. 

The paper focuses on the problem of clustering (i.e., estimating the class 
labels £i). Such a problem is of interest, especially in the study of complex 
disease [35] . In the two-dimensional phase space calibrating the signal rarity 
and signal strengths, we are interested in the following limits. 1 

• Statistical limits. This is the precise boundary that separates the Re¬ 
gion of Impossibility and Region of Possibility. In the former, the sig¬ 
nals are so rare and (individually) weak that it is impossible for any 
method to correctly identify most of the class labels. In the latter, 
the signals are strong enough to allow successful clustering, and it is 
desirable to develop methods that cluster successfully. 

• Computationally tractable statistical limits. This is similar to the bound¬ 
ary above, except that for both Possibility and Impossibility, we only 
consider statistical methods that are computationally tractable. 

We use Region of Possibility and Region of Impossibility as generic terms, 
which may vary from occurrence to occurrence. 

The paper also contains three closely related objectives as follows, which 
we discuss in Sections 1.4 and 2, Section 3, and Section 4, respectively. 

• Performance of the recent idea of Important Features PCA (IF-PCA). 

• Limits for recovering the support of p (signal recovery). 

• Limits for testing whether X^s are iid samples from N(0,I p ), or gen¬ 
erated from Model (1.2) (hypothesis testing). 

Our work on sparse clustering is related to Azizyan et al [7] and Chan 
and Hall [13] (see also [38, 39, 43, 49]): the three papers share the same 
spirit that we should do a feature selection before we cluster. Our work on 
support recovery is related to recent interest on sparse PCA (e.g., Amini 
and Wainwright [3], Johnstone and Lu [32], Vu and Lei [46], Wang et al 
[47], Arias-Castron and Verzelen [5]), and our work on hypothesis testing 
is related to recent interest on matrix estimation and matrix testing (e.g., 
Arias-Castro and Verzelen [5], Cai et al [11])- However, our work is different 
in many important aspects, especially for our focus on the limits and on the 
Rare/Weak models. See Section 6 for more discussion. 

1 A11 limits in this paper are with respect to the ARW model introduced in Section 1.2. 
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1.1. Four clustering methods. Denoting the data matrix by A, we write 

X — [A i ■ A*2, • • •, X n ], A — [x\, X 2 , • • • j Xp\. 

We introduce two methods: a feature aggregation method and IF-PCA. Each 
method includes a special case, which can be viewed as a different method. 

The first method targets on the case where the signals are rare but 
individually strong (“sa”: Sparse Aggression ; N: tuning parameter; usually, 
N <C p), so feature selection is desirable. Denote the support of /i by 


(1.3) = {1 < j <p : n(j) / 0}. 

The procedure first estimates S(n) by optimizing (|| • ||i: vector L * 1 2 -norm) 

(1.4) S ( ^ a) = argmax {5c{l!2i ... iP}: | S | =JV} {|| 


and then cluster by aggregating all selected features 4 0) = sgn (£. q(scl) X j ^ . 


An important special case is N = p, where reduces to the method 
of Simple Aggregation which we denote by l* sa \ 3 This procedure targets on 
the case where the signals are weak but less sparse, so feature selection is 
hopeless. Note that £^ is generally NP-hard but ii sa ' > is not. 

The second method is IF-PCA, denoted by £ q . where q > 0 is a tuning 
parameter. The method targets on the case where the signals are rare but 
individually strong. To use lq^\ we first select features using the x 2 -tests: 


(1.5) S^ f) = {l<j<p: Q(j) > ^/2qlog(p)}, Q(j) = (\\xjf - n)/V2n. 


We then obtain the first left singular vector of the post-selection data 
matrix X ^ (containing only columns of X where the indices are in Sq ^): 

(1.6) f (9) = £(*(«)), 


and cluster by lq^ = sgn(^ 9 )). IF-PCA includes the classical PCA (denoted 
by as a special case, where the feature selection step is skipped, and 

reduces to the first singular vector of A. 4 

In Table 1, we compare all four methods. Note that for more complicated 
cases (e.g., the nonzero /x(j)’s may be both positive and negative), we may 

2 For any vector x € R n , sgn(*) £ R n is the vector where the i-th entry is sgn(®i)> 

1 < i < n (sgn(a;i) = —1, 0,1 according to Xi < 0, = 0, or >0). 

2 The superscript “sa” now loses its original meaning, but we keep it for consistency. 

4 The superscript “if’ now loses its original meaning, but we keep it for consistency. 
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consider a variant of £^ which clusters by £^ = sgn (X£l), with ji being 
argmax {At(i)e{ _ 10il}i || M || 0=Ar} ||A>||g, where q > 0. If we let q = 1 and restrict 

fi(j) E {0,1}, it reduces to the current £^ a K Note that when N = p and 
q = 2, approximately, ft is proportional to the first right singular vector of 
X and £^ a ' > is approximately the classical PCA. Note also that £q^ can be 
viewed as the adaption of IF-PCA in Jin and Wang [31] to Model (1.2). 
The version in [31] is a tuning free algorithm for analyzing microarray data 
and is much more sophisticated. The current version of IF-PCA is similar 
to that in Johnstone and Lu [32] but is also different in purpose and in 
implementation: the former is for estimating £ and uses the first left singular 
vector of the post-selection data matrix, and the latter is for estimating \i 
and uses the first right singular vector. The theory two methods entail are 
also very different. See Sections 1.8 and 6 for more discussion. 


Table 1 

Comparison of basic characteristics of four methods. *: signals are comparably stronger 
but still weak, f: a tuning-free version exists. 


Methods 

Simple Aggregation 

p(sa) 

Sparse Aggregation 

i ( N a) (n « P ) 

Classical PCA 

IF-PCA 

% f \q> 0) 

Signals 

less sparse/weak 

sparse/strong* 

moderately sparse /weak 

very sparse/strong 

Feature selection 

No 

Yes 

No 

Yes 

Comp, complexity 

Polynomial 

NP-hard 

Polynomial 

Polynomial 

Need tuning 

No 

Yes 

No 

Yesf 


1.2. Rare and Weak signal model. To study all these limits, we invoke 
the Asymptotic Rare and Weak (ARW) model [12, 19, 20, 26]. In ARW, for 
two parameters (e, r), we model the contrast mean vector ft by 

(1-7) h(j) ~ (1 - e)u 0 + ev T , l < j < p, 

where zz a denotes the point mass at a. In Model (1.7), all signals have the 
same sign and magnitude. Such an assumption can be largely relaxed; see 
Sections 1.6 and 6. We use p as the driving asymptotic parameter and tie 
(n, e, r) to p by fixed parameters. In detail, fixing (0, /3) E (0, l) 2 and a > 0, 
we model 

(1.8) n = n p = p e , e = e p = p~P, r = t p = p~ a . 

In our model, n<j) for we focus on the modern “large n, really large p n 
regime [42] . The study can be conveniently extended to the case of n 3> p. 
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1.3. Limits for clustering. Let II be the set of all possible permuta¬ 
tions on {—1,1}. For any clustering procedure £ (where t % takes values from 
{—1,1}), we measure the performance by the Hamming distance: 


(1.9) 


Hamm p (£, a, (3,0) = n 


-1 ; 


inf 

7r£ll 


2—1 


P(ii + vr if) 


where the probability is evaluated with respective to ( fi,l,Z ). Fixing 6 G 
(0,1), introduce a curve a = r]g lu ((3 ) in the (3-a plane by 


vf u (P) 


(1 - 2/3)/2, /5 < (1 — 0)/2, 

0/2, (1 — 0)/2 < (3 < (1 — 0), 

(l-/3)/2, 13 >(1-0). 


Theorem 1.1 (Statistical lower bound). 5 Fix (0,(3) G (0, l) 2 and a > 0 
such that a > r/^“(/3). Consider the clustering problem for Models (1.1)-(1.2) 
and (1.7)-(1.8). For any procedure £, liminfp^oo Hamm p (^, a, (3,0) > 1/2. 

Theorem 1.2 (Statistical upper bound for clustering). Fix (0,(3) G (0, l) 2 
and a > 0 such that a < t]q 1u (( 3), and consider the clustering problem for 
Models (1.1)-(1.2) and (1.7)-(1.8). As p —> oo, 

• Hamm p (ii sa \ a, (3, 6) —> 0, if 0 < (3 < (1 — 0)/ 2. 

• Hamm p (£^ a \a, (3, 0) —> 0, if (1 — 0)/2 < (3 < 1 and N = \pe p ~\ . 6 

As a result, the curve a = r]g lu (P) divides the (3-a plane into two regions: 
Region of Impossibility and Region of Possibility. In the former, the signals 
are so weak that successful clustering is impossible. In the latter, the signals 
are strong enough to allow successful clustering. 

Consider computationally tractable limits. We call a curve r = rjg(/3) in 
the (3-a plane a Computationally Tractable Upper Bound (CTUB) if for any 
fixed (6, a, (3) such that a < ? ]e((3), there is a computationally tractable 
clustering method £ such that Hamm p (l, a, (3, 6) —> 0. A CTUB r = r]e(/3) is 
tight if for any computationally tractable method £ and any fixed (0, a, (3) 
such that a > rjg(P), liminfp^oo Hamm p (^, a, (3, 0) > 1/2. In this case, we 
call r = t]q(/3) the Computationally Tractable Boundary (CTB). Define 

' (1-2(3)/2, (3 < (1 — 0)/2, 

~du (R , J (1 + 0- 2(3)/ 4, (1 - 0)/2 <(3< 1/2, 

Vd {l 1 0/ 4, 1/2 < (3 < 1-0/2, 

k (l-0)/2, 1 — 6/2 < (3 < 1. 

5 The “lower bound” refers to the information lower bound as in the literature, not the 
lower bound for the curves in Figure 1 (say). Same for the “upper bound”. 

6 \x\ denotes the smallest integer that is no smaller than x. 




6 



P P 


Fig 1. Left: the statistical limits (red) and the CTUB (green) for clustering (s = pe p is the 
expected number of signals). Right: phase transition of IF-PCA. White region: successful 
clustering is possible but successful feature selection is impossible (using column-wise y 2 
scores). Yellow region: both successful clustering and feature selection are possible. 

Theorem 1.3 (A CTUB for clustering). Fix ( 9,(5) £ (0, l) 2 and a > 0 such 
that a < fjg lu (/3), and consider the clustering problem for Models (1.1)-(1.2) 
and (1.7)-(1.8). As p —> oo, 

• Hamm p (ii sa \ a, f3, 9) —> 0, if 0 < (3 < (1 — 9)/2. 

• Hamnip(!t^,a, (3, 9) -» 0, if (1 — 9)/2 < f3 < 1/2. 

• Hamm p (iq^, a, (3, 9) -A 0, if 1/2 < (3 < 1 and we take q > 3. 

We now discuss CTB. We discuss the cases (a) 0 < (3 < (1 — 9)/ 2, (b) 
(1 - 9)/2 < f3 < 1/2, (c) 1/2 < 9 < 1 - 9/2, and (d) 1 - 9/2 < < 1 

separately. Note that the CTB is sandwiched by two curves a = r]g lu (/3 ) and 
a = rjg lu (/3). In (a) and (d), rjg lu (/3 ) = r]g lu (/3), so our CTUB (i.e., CTUB 
given in Theorem 1.3) is tight. For (b), we are not sure but we conjecture that 
our CTUB is tight.' For (c), we have good reasons to believe that our CTUB 
is tight. In fact, our model is intimately connected to the spike model [32]; 
see Section 1.8. The tightness of our CTUB under the spike model has been 
well-studied (e.g., [9, 37]). Translating their results 7 8 to our setting suggests 
that there is a small constant 6 > 0 such that when 1/2 < /3 < 1/2 + 5, 
our CTUB is tight. Note that for (c), the CTUB a = fjg lu (f3) is flat. By the 
monotonicity of CTB (see below), our CTUB is tight for (c). See Figure 1. 

7 We know that CTB crosses two points (/3, a) = (1/2, 9/4) and (/3, a) = ((1— 6)/2, 9/2). 
A natural guess is that the CTB in this part is a line segment connecting the two points. 

8 Consider the hypothesis testing in the spike model. [9] proves that, with the “planted 
clique” conjecture, for n < p and s = o(Cp), if IImIIo = s and ||/x|| 2 < Sy/log(p)/n, there is 
no polynomial-time test that is powerful. In ARW, since ||ii|| 2 « sr 2 , the above translates 
to (ignoring the logarithmic factor) a > 9/4 = rjg lu (/3). 
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Remark ( Monotonicity of CTB ). We show the CTB is monotone in (3 
(with 9 fixed). Fix <5 > 0 and consider a new experiment, where for each 
column of the data matrix, we keep the column with probability p~ 5 and 
replace it with an independent column drawn from iV(0, I n ) with probability 
1 — p~ s . Compare this with the original experiment. The parameters (a, 9) 
are the same, but (3 has become (/? + 6). The second experiment is harder, 
for it is the result of the original experiment by sub-sampling the columns. 
This shows that the CTB is monotone in f3. The monotonicity now follows 
by Le Cam’s results on comparison of experiments [34], 

1.4. Phase transition for IF-PCA. IF-PCA is a flexible clustering method 
that is easy to use and computationally efficient. In [31], we developed a tun¬ 
ing free version of IF-PCA using Higher Criticism [19, 21, 27] and applied it 
to 10 microarray data sets with satisfactory results. The success of IF-PCA 
in real data analysis motivates us to investigate the method in depth. To 
facilitate delicate analysis, we consider the version of IF-PCA in Section 1.1, 
and reveal an interesting phase transition. 

To this end, we investigate a very challenging case (not covered in Theo¬ 
rems 1.1-1.3) where (a, (3) fall exactly on the CTUB in Theorem 1.3: 

(1.10) a = rjg lu ((3). 9 

Also, note that a key step in IF-PCA is the column-wise x 2 -screening. In 
our model, a column Xj is either distributed as 1V(0, I n ) or N(t p £, I n ), where 
t p = p~ a . For the x 2 -screening to be non-trivial, we further require that 

( 1 . 11 ) 1/2 <0 < 1 - 0 / 2 . 

For (3 in this range, the curve a = f)Q lu ((3) is flat, i.e., f}g lu {f3) = 9/ 4, and 
so t p = p -e / 4 = n -1 / 4 . For f3 outside this range, (1.10) dictates that either 
t p <C n~ 4 / 4 (so that the signals are too weak that the x 2 -screening bounds to 
fail) or T p n -1 / 4 (so that the signals are too strong that the x 2 -screening 
is relatively trivial). See Figure 1. 

We now restrict our attention to (l.lO)-(l.ll), where we recall that t p = 
p~ e / 4 . To make the case more interesting, we adjust the calibration of r p 
slightly by an ©(log 1 / 4 ^)) factor: 

(1.12) t* = p~ e ^(Ar log(p)) 1 / 4 , where 0 < r < 1 is a fixed parameter. 
With this calibration, the y 2 -screening could be successful but non-trivial. 


9 The case a < f]g lu {3) is comparably easier to study, and the case a > fje> lu {3) belongs 
to the Region of Impossibility for computationally tractable methods; see our conjectures. 



Introduce the standard phase function 10 [19, 20] 


(1.13) 


P*W) 


P - 1/2, 

(1 - VT^) 2 , 


1/2 <(3< 3/4, 
3/4 < (3 < 1. 


Define the phase function for IF-PCA by 


(1.14) pffl) = (1 -0)-p* (l/2 + , 1/2 < 13 < 1 - 0/2. 


For any two vectors x and y in R n , let cos (x,y) = |(x/||x||,y/||y||)|. 

Theorem 1.4 (Phase transition for IF-PCA). Fix ( 6,/3,a,r ) € (0, l) 4 and 
q > 0 such that (l.lO)-(l.ll) hold. Consider IF-PCA lg^ for Models (1-1)- 
(1.2) and (1.7)-(1.8), where t p is replaced by the new calibration t* in (1.12), 
and let be the leading left singular vector as in (1.6). As p —> oo, 

• If r > p* e (fi), then with probability at least 1 — o(p~ 2 ), cos(^ q \(.) —¥ 1 
withq* = (/3—6/2+r) 2 /(4r)forr > (/3—0/2)/3 and q* = 4r otherwise. 

• Ifr < Pq(/ 3), then with probability at least 1 — o{n~ l ), there is a constant 
Co G (0,1) such that cos(^ q \£) < Co for any fixed 0 < q < 1. 

Theorem 1.4 is proved in Section 2, using delicate spectral analysis on the 
post-selection data matrix (and so the term of post-selection Random Matrix 
Theory (RMT)). Compared to many works on RMT where the data matrix 
has independent entries [45], the entries of the post-selection data matrix 
are complicatedly correlated, so the required analysis is more delicate. We 
conjecture that when r < p* e ((I), cos(f( q \£) —>■ 0 for any fixed 0 < q < 1. For 
now, we can only show this for q in a certain range; see the proof for details. 

Figure 1 (right) displays the phase diagram for IF-PCA. For fixed (a, f3) in 
the interior of the white region, successful feature selection is impossible (by 
column-wise y 2 -screening) but successful clustering is possible. This shows 
that feature selection and clustering are related but different problems. 

Remark. For the IF-PCA considered here, we use column-wise y 2 -tests 
for screening which is computationally inexpensive. Alternatively, we may 
use some regularization methods for screening (e.g., [15, 36, 50]). However, 
these methods are computationally more expensive, need tuning parameters 
that are hard to set, and are designed for feature selection, not clustering. 
For these reasons, it is unclear whether such alternatives may really help. 

1(, It was introduced in the literature to study the phase transitions of multiple testing 
and classification with rare/weak signals. 
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1.5. Clustering when the noise is colored. Consider a new version of 
ARW where (£,p) are the same as in Models (1.1), (1.7)-(1.8), but Model 
(1.2) is replaced by a colored noise model 

(1.15) X = ln' + AZB, Zi(j) ~ N(0,1), l<i<n,l<j<p, 
where A and B are two non-random matrices. 

Definition 1.1 We use L p > 0 to denote a generic multi-\og{p) term which 
may vary from occurrence to occurrence such that for any fixed 5 > 0, 
L p p~ s ~t 0 an d L p p s — > oo, as p —> oo. 

Theorem 1.5 (Statistical lower bound for clustering with colored noise). 
Consider the ARW model (1.1)-(1.2) and (1.7)-(1.8). Theorem 1.1 continues 
to hold if we replace the model (1.2) by (1.15) where max{||v4||, ||A~ 1 ||} < L p 
and max{||B||, ||13~ 1 ||} < L p . 

Theorem 1.5 is proved in Section 5, using Le Cam’s comparison of experi¬ 
ments [34] . The idea is to construct a new experiment that is easy to analyze 
and that the current one can be viewed as the result of adding noise to it. 
Since “adding noise always makes the inference harder”, analyzing the new 
experiment provides a lower bound we need for the current experiment. The 
idea has been used in Hall and Jin [24], but for very different settings. 

Consider the case A = I n . In this case, the matrix AZB has independent 
rows (but the columns may be correlated and heteroscedastic), and all four 
methods we proposed earlier continue to work, except that in IF-PCA we 
need q > 3 max{diag(H , H)}. The following theorem is proved in Section 3. 

Theorem 1.6 (Upper bounds for clustering with colored noise). Consider 
the ARW model (1.1)-(1.2) and (1.7)-(1.8). Theorems 1.2-1.3 continue to 
hold if we replace the model (1.2) by (1.15) with A = I n and B such that 
max{||5||, ||i? _1 ||} < L p and that all diagonals of B'B is upper bounded by 
a constant c > 0, where we set q > 3c in IF-PCA. 

Practically, it is desirable to have a method that does not depend on the 
unknown parameter c. One way to attack this is to replace the column-wise 
% 2 -test by a plug-in y 2 -test where we estimate the variance column-wise by 
Median Absolute Deviation (say). However, such methods usually involve 
statistics of higher order moments; see [5] for discussions along this line. 

1.6. Limits for signal recovery and hypothesis testing. For a more com¬ 
plete picture, we study the limits for signal recovery and hypothesis testing. 

The goal of signal recovery is to recover the support of p. For any feature 
selector S, we measure the error by the (normalized) Hamming distance 
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Hamm p (S,a,(3,9) = (pe p ) 1 YJj=i[P(^(j) = 0 ,j G S)+P(n(j) / 0 ,j 5)], 
where pe p is the expected number of signals. Define 


and 


vf’iP) 


9/2, 

(i + e-p)/4, 


P >(1~0), 


[ 0 / 2 , 

fj s e l9 (P) = { (l + 0-2(5)/4, 
{ 0/4, 


P < (1 — 0)/2, 

(1 - 0)/2 </? < 1 / 2 , 
P > 1 / 2 . 


The curve r = rig l9 (P) can be viewed as the counterpart of r = Tp} u (P), which 
divides the two-dimensional phase space into the Region of Impossibility 
and Region of Possibility. For any fixed (P, a) in the former and any S, 
Hamm p (5, a, (3,9) > 1. For any fixed (ft, a) in the latter, there is an S such 
that Hamm p (S', a, /3, 9) —> 0. The curve r = f} s p 9 (P) can be viewed as the 
counterpart of r = f)g lu (/3) and provides a CTUB for the signal recovery 
problem. See Section 3 for more discussion. 

The goal of (global) hypothesis testing is to test a null hypothesis H^ 
that the data matrix X has iid entries from N( 0,1) against an alternative 
hypothesis that X is generated according to Model (1.2). Define 


(1.16) rjg VP (/3) = max{rig ypA (/3),r] l / yp ’ 2 (/3)}, fjg VP (P) 


ma *{riQ VP ' l (P), e -}, 


and rjg yp,1 (P) = (2+9—4f3)/4, r/// yp ' 2 = min{0/2, (1+0—/3)/4}. Similarly, the 
curve r = 'i'// yp (ft) divides the two-dimensional phase space into the Region 
of Impossibility and Region of Possibility. Fix ((3, a) in the former, the sum 
of Type I and Type II errors > 1 for any testing procedures. Fixing ((3, a) 
in the latter, there is a test such that the sum of Type I and Type II errors 
tends to 0. Also, the curve r = fjg VP (P) provides a CTUB for the hypothesis 
testing problem. See Section 4 for more discussion. 

The statistical limits for hypothesis testing here are different from those 
in Arias-Castro and Verzelen [5]. For the less sparse case (/3 < 0/2), the 
signal strength needed in our model is weaker, because all signals have the 
same sign. More interestingly, we find a phase transition phenomenon that 
is not seen in [5]: when 9 < 2/3, there are three segments for the statistical 
limits; when 9 > 2/3, there are only two segments. 11 

11 The curve r = % vp ((5) is the maximum of the boundary achievable by Simple Aggre¬ 
gation (a line segment) and that by Sparse Aggregation (two line segments). Depending 
on where two boundaries cross each other, r = % yp ((3) may consist of 2 or 3 line segments. 
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The tightness of CTUB for signal recovery and hypothesis testing can be 
addressed similarly to that for clustering. For signal recovery, the CTUB 
is tight in the less sparse case (0 < (3 < (1 — 0)/ 2) for it matches the 
statistical limits; we have good reasons to believe it is tight in the sparse 
case (1/2 < /3 < 1), due to results in [9, 37]; we are not sure for the moderate 
sparse case ((1 — 0)/2 < /3 < 1/2). For hypothesis testing, we have similar 
arguments except that the cases of “less sparse” and “moderate sparse” refer 
to that of 0 < (3 < (2 — 0)/A and that of (2 — 0)/4 < /3 < 1/2, respectively. 

Figure 2 compares the limits for all three problems: clustering, signal 
recovery, and hypothesis testing. See details therein. 

Remark. Consider an extension of ARW where (1.7) is replaced by a 

more complicated signal configuration: j) (1 — e)vo + aev_ T + (l — a)ev T , 

where 0<a<l/2isa constant (a = 0: original ARW). When 0 < a < 1/2, 
our results on statistical limits and CTUB for all three problems continue 
to hold, provided with a slight change in the definition of the Hamming 
distance for signal recovery. The case of a = 1/2 is more delicate, but the 
changes in statistical limits (compared to the case of a = 0) can be explained 
with Figure 2 (top left): (a) the black curve (signal recovery) remains the 
same, (b) the red curve (clustering) remains the same, except for the segment 
on the left is replaced by r 4 = p/(ns 2 ), (c) for the blue curve (hypothesis 
testing), the right most segment remains the same, while the other two 
segments coincide with those of the red curve. The CTUBs also change 
correspondingly. See Appendix D for a more detailed discussion. 

1.7. Practical relevance and a real data example. The relatively idealized 
model we use allows very delicate analysis, but also raises practical concerns. 
In this section, we investigate IF-PCA with a real data example and illustrate 
that many ideas in previous sections are relevant in much broader settings. 


Table 2 

The clustering errors for Leukemia data with different numbers of selected features. Rows 
highlighted correspond to the threshold choices that yield lowest clustering errors. 


#{selected features} 

Errors 

#(selected features} 

Errors 

#{selected features} 

Errors 

i 

34 

1419 

3 

2847 

5 

347 

8 

1776 

1 

3204 

7 

704 

6 

2133 

1 

3561 

11 

1062 

5 

2490 

1 




We use the leukemia data set on gene microarrays. This data set was 
cleaned by Dettling [17], consisting of p = 3571 measured genes for n = 72 
samples from two classes: 47 from ALL (acute lymphoblastic leukemia), 
and 25 from AML (acute myeloid leukemia). The data set is available at 
wwu.stat.emu.edu/~jiashun/Research/software/GenomicsData/ALL. 
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P P 




Fig 2. Top left: statistical limits for clustering (red), signal recovery (black), and hypothesis 
testing (blue); s = pe p . Other three panels: CTUB for clustering (top right), signal recovery 
(bottom left) and hypothesis testing (bottom right), respectively (the three statistical limits 
in the top left panel are also shown for comparison). 


To implement IF-PCA, one noteworthy difficulty is the heteroscedastic- 
ity across genes in the data set. We apply IF-PCA with small modifica¬ 
tions. In detail, arrange the data matrix as X = [x\,... ,x p ] as before. Let 
x(j) = (1/n) m { x j) = median(x'j) and d(j) = median{|xj(l) — 

m(xj )|, • • • , | Xj(n) — m(xj)\} be the Median Absolute Deviation (MAD). We 
normalize by x*^(i) = 0.6745 • (. Xj(i) — x(j))/d(j), 1 < * < n, 1 < j < p 12 . For 
q > 0 to be determined, we select feature j if and only if (2n) _1 | ||x*|| 2 — n| > 
y/2q\og(p). We then obtain the leading left singular vector (^*)( l ?) Q f the 
post-selection data matrix [x\, ■ ■ ■ , x*] and cluster by applying the standard 
fc-means algorithm to the leading eigenvector. In the last step, we can also 
cluster by the sign vector of (£*and the results are similar. The fc-means 
algorithm has a slightly better performance. 

Table 2 displays the clustering errors for different numbers of selected 


12 The value 0.6745 is such that E[(xj(i)) 2 ] = 1 when Xj(i) ~ N( 0, a 2 ) for any a > 0. 
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Fig 3. Leading left singular vector of the data matrix X with very few features selected by 
FDR choice (left; 931 features chosen), with ideal number of features selected (middle; 2133 
features chosen), and without feature selection (right), y-axis: entries of the left singular 
vector, x-axis: sample indices. Plots are based on Leukemia data, where red and green dots 
represent samples from the two classes ALL and AML, respectively. 


Table 3 

Comparison of clustering errors (Leukemia data). Columns 2-7: numerator is the 
number of clustering errors, and denominator is the number of subjects. 


Method 

/c-means 

Ai-means++ 

Hierarchical 

SpectralGem 

Sparse fc-means 

IF-PCA 

Error Rate 

20/72 

18.5/72 

20/72 

21/72 

20/72 

1/72 


features (each corresponds to a choice of q ). The table suggests that IF-PCA 
works nicely, with an error rate as low as 1/72, if q is set appropriately. 

Figure 3 compares for three choices of q: (a) the q determined by 

applying the FDR controlling procedure [8] with the FDR parameter of .05 
and simulated P -values under the null Xj ~ N(0,I n ), (b) the q associated 
with the ideal number of selected features (see Table 2), and (c) the q corre¬ 
sponding to classical PC A (any q that allows us to skip the feature selection 
step works). This suggests that IF-PCA works well if q is properly set. 13 

We compare IF-PCA with classical methods of /c-means and hierarchical 
clustering [25], fc-means-t—I- (a recent revision of the classical fc-means; [6]), 14 
SpectralGem (classical PCA applied to X *; [35]), and sparse k -means (a 
modification of /c-means with sparse feature weights in the objective; [49]). 
The error rates are in Table 3, suggesting IF-PCA is effective in this case. 

1.8. Comparison to works on the spike model. In our model (1.1)-(1.2), 
if we replace the Bernoulli model for (; in (1.1) by a Gaussian model where 

1J A hard problem is how to set q in a data-driven fashion. This is addressed in [28]. 

14 For fc-means, we use the built-in Matlab package (parameter ‘replicates’ equals 30). 
For fc-means-|—|-, we run the program 30 times, and compute the average clustering errors. 
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£i N(0, a 2 ), then it becomes the spike model (Johnstone and Lu [32]). 

In the spike model, while are also of interest, the feature vector g, 
captures most of the attention: most recent works on the spike model (e.g., 
[4, 36, 47]) have been focused on signal recovery (and especially, sparse 
PCA). The two problems, signal recovery and clustering, are different. There 
are parameter settings where successful clustering is possible but successful 
signal recovery is impossible, and there are settings where the opposite is 
true; see Sections 1.4 and 1.6. Therefore, a direct extension of sparse PCA 
methods to clustering does not always work well. 

Our work is also different from existing works on the spike model in terms 
of motivation and validation. Our model is motivated by cancer (subject) 
clustering, where the class labels can be conveniently validated in many 
applications (e.g., see Section 1.7). In contrast, it is not easy to find real 
data sets where the feature vector g is known, so it is comparably harder 
to validate the methods/theory on signal recovery or sparse PCA. Given 
the growing awareness of reproducibility and replicability [23], it becomes 
increasingly more important to develop methods and theory that can be di¬ 
rectly validated by real applications. In a sense, our model extends the spike 
model to a new direction, and it helps strengthen (we hope) the ties between 
the recent theoretical interests on the spike model with real applications. 

1.9. Content and notations. Section 2 studies the phase transition of IF- 
PCA, where we prove Theorem 1.4. Section 3 studies the statistical limits 
for signal recovery, where we prove Theorems 1.2, 1.3, 1.6, as well as The¬ 
orems 3.2-3.3 (to be introduced). Section 4 studies the statistical limits for 
hypothesis testing, where we prove Theorems 4. 2-4. 3 (to be introduced). 
Section 5 studies the lower bounds for all three problems and proves Theo¬ 
rems 1.1 and 1.5, as well as Theorems 3.1 and 4.1 (to be introduced). Other 
proofs are in the Appendix. Section 6 is for discussion. 

In this paper, L p > 0 denotes a generic multi-log(p) term; see Section 
1.5. When £ is a vector, ||£||g denotes the vector L q - norm, 0 < q < oo (the 
subscript is dropped for simplicity if q = 2). When £ is a matrix, ||£|| denotes 
the matrix spectral norm, and ||£||_f denotes the matrix Frobenius norm. For 
two vectors £, rj, (£, rj) denotes the inner product of them, and cos(£, rj) = 
l(£/ll£||> r ?/IMI)|- For any two probability densities / and g, \\f — g||i and 
H(f,g ) are the L 1 -distance and the Hellinger distance, respectively. For any 
real value a, [a] is the smallest integer that is no smaller than a. We say 
two positive sequences a n ~ b n , a n < b n and a n > b n if lim^oo a n /b n = 1, 
limsupjj^.^ a n /b n < 1 and liminf n _ ) . 0O a n /b n > 1, respectively. For two sets 
A, B, AAB = ( A\B ) U (B\A). 
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2. Phase transition for IF-PCA. In this section, we prove Theo¬ 
rem 1.4. Our proofs need very precise characterization of the spectra of the 
post-selection Gram matrix X^ q \X^)' . Specifically, we need both a tight 
upper bound on the range of the spectra of X^ q \X^)' (Lemma 2.1) and 
a tight lower bound for the largest eigenvalue of X^ q \X^)' (Lemma 2.2). 
The main challenges are that, due to feature selection, 

• the entries of X^ are no longer independent, 

• the conditional distribution of each survived column is unclear. 

For this reason, existing results on RMT do not apply directly and we need 
to develop new theory on post-selection RMT. Our analysis adapts that in 
Vershynin [45] and uses the results of covering number in Rogers [40]. 

Remark. For the spike model, there are results about the spectra of a dif¬ 
ferent post-selection Gram matrix (X^)'X^ (e.g., Thereom 2 of [32]). Since 
feature selection is column-wise, the leading eigenvectors of (X^)'X^ and 
X^ q \X^)' have very different behaviors. Moreover, the settings of [32, The¬ 
orem 2] implicitly force X^ to have much more rows than columns (which 
we call the “skinny” case), but our results do not have such a restriction. 

To show the claim, it suffices to show the claim for any fixed realization 
of (£, p) in the event 

D p = {p : \\S(p)\ — pe p \ < £>pe p \og{p)}\ 

note that P(Dp) = 0(p~ 3 ) and the event only has a negligible effect. Fixing 

0 < q < 1 and a realization of (£, p) in D p . Let Sq ^ (£, p) be the set of all sur¬ 
vived features. In our model, X = £p'+Z, and Z = [z\, Z 2 , ■ ■ ■, z p \. Introduce 
a vector p^ = p^(£, p) G BP and a matrix = [z[ q \ • • • , z p ? ' ) ] G R n,p by 

P {q \j) = p(j)-l{jesW\£,p)}, zf= Zj -l{jesW(£,p)}, l <j<p, 

and so the post-selection data matrix X^ = X^ q \£,p), viewed as an n x p 
matrix with many zero columns, satisfies 

X( q \£,p)=£p( q \£,p) + Z( q \£,p). 

Fixing ((3 , 6, r ) G (0, l) 3 and q > 0, and assuming z ~ N( 0, I n ), introduce 
m^\p) = (p - |5(jti)|) • P(|M| 2 > n + ‘IsJqn\og(p)), m[ q) (£,p) = |S(ju)| • 
P(\\z + t*£ || 2 > n + 2y/qnlog(p)), and m^ q \£, p) = m^(f, p) + m^ q \£, p). 
Note that rri'^(£, p) and m^ q \£,p) are the expected numbers of survived 
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useless/useful features, respectively. We also need the following counterpart 
of m (q \£,p): 

= (p- |S(ju)|) • n _1 £’(||z|| 2 l{||z|| 2 > n + 2y / qnlog(p)}) 

+ \S(n)\ ■ Ji^EdlzfliWz + t*£\\ 2 > n + 2y/qnlog(p)}). 

The dependence on (£, p) is tedious, so for notational simplicity, we may 
drop them without further notices. 

The term mS q ' is the expected number of selected features, and plays 
an important role. By tail properties of chi-square distributions (see Sec¬ 
tion B.l), with probability 1 — 0(p~ 3 ), 

(2.1) mi 9 '* ~ m ^ ~ L p \p l ~ q + pe p p ~^^~^ + ^], 

where as before L p is a generic multi-log(p) term. Recalling n = p e , define 

~ (R n _ f maxjl - 9, (VI -P-9 + Vr) 2 }, 0 <1-0, 
q{Pj ’ > \ 1-6, 0>l-9. 

By (2.1) and basic algebra, it is seen that there are two different cases: 

• ( “Fat”). When q < q(/3,9,r), m^/n —> oo and has much more 
columns than rows. 

• ( “Skinny”). When q > q(0, 6, r), m^/n —> 0 and X ^ has much more 
rows than columns. 

Lemma 2.1 (Upper bound for the range of eigenvalues of Z^ q \Z^)'). Sup¬ 
pose conditions of Theorem l.f hold. There exists a universal constant C > 0 
such that for any fixed q > 0, as p —>• oo, conditioning on any realization of 
(£,p) from the event D p , with probability at least 1 — 0(p~ 3 ), 

• (“Fat” case). When q < q(/3,9,r), all eigenvalues of Z^ q \Z^)' fall 
between mi 9 ' 1 ± [Cy/nmf q l log(p) + o (m^)]. 

• (“Skinny” case). When q > q(/3, 9, r), all nonzero eigenvalues of Z^ (Z^)’ 
fall between n ± Cy/nm to) log(p). 

Remark. Noting that m^ is the expected number of columns of Z ^, our 
results are very similar to the well-known results on eigenvalues of RMT in 
the case where we have an nx rrM' 1 matrix with iid N( 0,1) entries. However, 
we need more sophisticated proofs, as the rows of Z^ are dependent and 
the distribution of the columns of Z ^ is unknown and hard to characterize. 

For the “fat” case, it turns out that Lemma 2.1 is insufficient: we need both 
an improved upper bound on the range (with the y/\og(p) factor eliminated) 
and a lower bound on the leading eigenvalue. 
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Lemma 2.2 (Improved bound (“fat” case)). Suppose the conditions of The¬ 
orem 1.4 hold and r < Pg(/3). There exist constants c\ > C 2 > 0 such that as 
p —> 00 , for any fixed q > 0 , conditioning on any realization of (£, p) from 
the event D p , with probability 1 — 0(nG 2 ), 

• All singular values of Z^ q l (Z^)' fall between ± c±V ; 

• A max (Z^)(Z^)') > mi q) + c 2 Vnm(il. 

We now prove Theorem 1.4. We show the cases of r > Pg(/3 ) (Region of 
Possibility) and r < p* g (f3) (Region of Impossibility) separately. 


2.1. Region of Possibility. Consider the case r > p* e ((5). Recall that 


q = q*(P,Q,r) 


4r, r < (/? — 9/ 2)/3, 

W ~ e H +r)2 , (J3-9/2)/3<r < 1 . 


Let £* be the first left singular vector of X ^ at <7 = g*(/3, 0, r). The goal is 
to show 

cos(^,^*) -> 1. 

Write 


( 2 . 2 ) X^(X^y = + z {q \z^)' + A, 

where A = (Z^)' + Z^pf^H for short. On the right hand side of 

( 2 . 2 ) , the first matrix has a rank 1 , with n||/i ' 9) || 2 being the only nonzero 

eigenvalue and £ being the associated eigenvector. In our model, the ex¬ 
pectation of is equal to (r*) 2 m^\ where by tail properties of chi- 

square distributions (see Section B.l), rri[ q ^ = L p p 1 ~^p~^'^ q ~^+^ 2 with 
overwhelming probabilities. It follows that with a probability at least 1 — 

0{p~ 3 ), 

n\\p^\\ 2 >n( T ;) 2 -m[ q) ^ LpP A ^’ e ’ r \ 

where A (q, /3, 9, r) = 1+0/2— ft— [(y/q— \/r) + ] 2 . Compare this with (2.2). By 
perturbation theory in matrices 15 [ 11 , 16], to show the claim, it suffices to 
show that there is a scalar a* (either random or non-random) and a constant 
S* > 0 so that 16 


\\Z^\Z^)' + A-a*I n \\ < L pP A ^’ e ^~ s *. 

15 We use [11, Proposition 1], a variant of the sine-theta theorem [16]. By that propo¬ 
sition, if £ and 4 are the respective leading eigenvectors of two symmetric matrices G and 
G, where G has a rank 1, then ||££' — ££'|| < 21|G7||“ 1 1|C7 — G||. We also note that for two 
unit-norm vectors £ and I, cos(£,£) —> 1 if and only if ||££' — ££'|| —> 0 by linear algebra. 

16 We have used the fact that adding/subtracting a multiple of the identity matrix does 
not affect the eigenvectors. 
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To this end, note that by triangle inequality, 

|| Z^\Z^)' + A - o*I n || < || zM(Z®y - a*/ n || + ||A||. 

The following lemma is proved in the Appendix. 

Lemma 2.3 Suppose conditions of Theorem l.f hold. For any fixed q > 0, 
as p —>• oo, conditioning on any realization of (£, p) from the event D p , with 

probability 1 - 0(p~ 3 ), || I{p^)'{Z^)' + Z^p^F || < CnH p yfmfi. 

The key to the proof is to control || Z^p^ Hoc using the Bernstein inequality 
[41] and to study the distribution of Z^ q \ See [29] for details. 

Now, when q > q(/3, 8, r), we are in the “skinny” case, combining Lemmas 
2.1 and 2.3, we have that with probability at least 1 — 0(p~ 3 ), 

\\Z^{Z^y+A\\ < n+cfnr^y/n^+yjnm^) log(p)) < L pP %+\ 

In the last inequality, we have used (2.1) which indicates that wig ' 1 = L p p 1 ~ q 
and rrif^ = L p p 1 _ ' S_ [(v / 5 _ v / i ; )+]“. gy g ie condition of r > Pg(/3), it can be 
shown that A (q,/3,6,r) > 8, and the claim follows by letting a* = 0 and 
6* = ^ 9 ^. When q < q(/3, 8, r), we are in the “fat” case. Combining Lemmas 
2.1 and 2.3, with probability at least 1 — 0(p ~ 3 ), 

|| z^ q \z^y + A — ml 9 *4|| < c[nr* \J+ \Jnm (?) log(p) + 

< L P P^ + ^ mllx { e A(q,P,S,r)d-q} _|_ pA(g,/3,0,r)-^ . 

By the condition of r > p* e {@), it can be shown that A (q, f3, 8 , r) > max{$, e+3 ^ q }, 
and the claim follows by letting a* = mi 9 '* and <5* = minj^ 1 ^, A— 1-9+6 , ^}. 

2.2. Region of Impossibility. Consider the case r < Pg(/3). Fix 0 < q < 1. 
Recall that C 9 ' is first left singular vector of X^ q \ The goal is to show that 

(2.3) cos(£, £^) < Co < 1, for any 0 < q < 1, 

where Co is a universal constant independent of q. Denote for short H = 
xM(X®y, Ho = zh\zti>y, £ = and £ = £/\\£\\. Let the eigenvalues 
of H be Ai (H) > A 2 (H) >...> A n (H). Write 

l = a£ + \[\ — a 2 p, for a unit-norm vector p such that p _L £. 

Note that ffiHp = = 0 and p'Hp > X n , we have F HI = a 2 F H^ + 

2a\J\ — a 2 FHp + (1 — a 2 )p’ Hp > a 2 Ai + (1 — a 2 )X n . Rearranging it gives 
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a 2 < 1 — [Ai(.£f) — £' H£\/[Xi(H) — A n (H)]. Note that cos(£, £) = |a|. So to 
show (2.3), it suffices to show there 

. Ai (H)-i'm o , . 

(2.4) --——— > 1 — c 0 , tor some constant Co G (0,1). 

Ai(A/) — A n [H) 

The following lemma is proved in the Appendix. 

Lemma 2.4 Suppose r < p* d (j3) and the conditions of Theorem l.f. hold. As 
p —> 00 , for any fixed q > 0, conditioning on any realization of (£, p) from 
the event D p , for any v G 5 n_1 , with probability 1 — 0(p~ 3 ), \v'H$v — \ < 

Cy/m^l log (jp), and 


( o(n), q>q(/3,r,9) (“skinny” case), 

\ o(VnmJA)), q<q((3,r,9) (“fat” case). 


We now show (2.4). Similarly, let \i(Hq) > \ 2 (Hq) > ... > X n (Ho) be 
the eigenvalues of Hq. We prove for the cases of q > q(/3,r,8) and q < 
q(/3, r, 9) separately. Consider the first case. This is the “skinny” case where 
m (q) n _ By Lemma 2.1 and the first claim of Lemma 2.4, with probability 
1 — 0(p~ 3 ), Xi(Hq) ~ n, X n (Ho) > 0 and £'Hq£ = o(n). By the second 
claim of Lemma 2.4, || H — Hq\\ = o{n). Combining the above with Weyl’s 
inequality [48] (i.e., maxi<j< n |A i(H) — Xi(Ho)\ < ||H — i4o||), we have 

X\(H) — A n (H) < n + o(n), Ai (H) — £'H£ > n — o(n). 


Inserting these into (2.4) gives the claim. 

Consider the second case. This is the “fat” case and m q l n. By Lemma 
2.2 and the first claim of Lemma 2.4, there is X\(Hq) — X n (Ho ) < C 2 V 
and Ai(iZo) — £'Hq£ > ciVnrnf9. Similarly, combining these with the second 
claim of Lemma 2.4 and Weyl’s inequality, we find that 

X\(H) — X n (H) < C2V Ai(iT) — i'Hi > c\ \/. 


Inserting these into (2.4) gives the claim. 


3. Limits for signal recovery. In this section, we discuss limits for 
signal (support) recovery. The results are intertwined with those for cluster¬ 
ing (namely, Theorems 1.1-1.3 and Theorem 1.5), so we prove all of them 
together in the later part of the section. 

Compare two problems: signal recovery and clustering. One useful insight 
is that in the less sparse case, clustering is comparably easier than signal 
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recovery, so we should estimate t first and then use it to estimate S(p); in 
the more sparse case, we should do the opposite. 

For the less sparse case, we have introduced two clustering methods, ii sa ^ 
and ii l/) , in Section 1.1. They give rise to two signal recovery methods, 
and . In detail, let yi^ = n~ l ^ 2 X'l^ a ' > and = rT l ^ 2 X'l^\ and 
let t* = \J2 log(p) be the universal threshold [ 22 ]. Respectively, and 

are defined by 


c( sa ) 


{i <j<p- \y[ s f\ 


>Q, 


s* f) = {i <j<p- \yi lf j ] \ 


For the more sparse case, we introduce two methods S 'jy and Sq^: they 
are in fact the ones that give rise to the clustering methods £^ a ' and £q^ we 
introduced in Section 1 . 1 . In detail, recalling that Q(j) = (2n)^ 1 ^ 2 (\\xj \\ 2 —n) 
is the column-wise x 2 -statistics, 

(3.1) S { ™ ] = argmax {5:Sc{lj2i . iP } ! | 5 | =7V} {^ 1 / 2 || 


and 

S^ lf) ={1 <j<p: Q{j) > y/2qlog(p)}. 

For any signal (support) recovery procedure S, we measure the perfor¬ 
mance by the normalized size of the difference of S and the true support 


(3.2) Hamm p (S,a,f3,0) = (pe p ) 1 £'(|S'AS'(/i)|), 


where AAB = {A \ B) U (B \ A) denotes the symmetric difference of two 
sets and the expectation is with respective to the randomness of (n,£,Z). 
If we think S as an estimate of p, say, fi, and E(\SAS(p)\) is actually the 
Hamming distance between the two vectors (sgn(|/l(l)|),..., sgn(|//(^?)|)) / 
and (sgn(/x(l)),..., sgn(p(p)))'. For this reason, we call that in (3.2) the 
(normalized) Hamming distance. 

In Section 1.6, we have introduced the curves a = r] s e l9 (/3) and a = fj S Q 9 (/3). 
The following theorem is proved in Section 5. 

Theorem 3.1 (Statistical lower bound for signal recovery). Fix (a, (3,9) € 
(0, l ) 3 and suppose a > r]g l9 ((3). Consider the signal recovery problem for 
Models (1.1)-(1.2) and (1.7)-(1.8). For any S that is an estimate for the 
support of S, Hamm p (S’, a, (3,9) > 1 as p —> oo. 

We also have the following theorems, which are proved below. 

Theorem 3.2 (Statistical upper bound for signal recovery). Fix (a, (5,0) 6 
(0, l ) 3 and suppose a < r] s e l9 (/3). Consider the signal recovery problem for 
Models (1.1)-(1.2) and (1.7)-(1.8). As p ^ oo, 
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• Hamm p (S'i' sa \ a,/3,9)—> 0, if 0 < (3 < (1 — 9)/ 2. 

• Hamni p (5jj a \ a, [3, 0) —> 0, if (1 — 9)/2 < (5 < 1 and N = \pe p ]. 

Theorem 3.3 (CTUB for signal recovery). Fix (a, (5,9) £ (0, l ) 3 and sup- 
pose a < fjQ l9 (/3). Consider the signal recovery problem for Models (1.1)- 
(1.2) and (1.7)-(1.8). As p —> oo, 

• Hamm p (S'i s “\ a, /3, 9) -» 0, i/ 0 < f3 < (1 — 0)/2. 

• Hamm^S 1 ^, a, (3,9) -> 0, if (1 — 9)/2 < /3 < 1 / 2 . 

• Hamm p (5 l q*^, a, /5, 9) —> 0, if (1 — 9)/2 < j3 < 1/2 and q > 3. 

3.1. Proofs of Theorems 1.2-1. 3, 1.6 and 3. 2-3.3. We need two lemmas. 
The first one is on classical PCA, and it is needed for studying and . 
The second one is a large-deviation inequality for folded normal random 
variables and it is needed for studying the optimization problem in (3.1). 

Lemma 3.1 Fix (ct, /3, 9) £ (0, l ) 3 such that (1 — 9)/2 < (3 < 1/2 and a < 
rjQ lu {/3). In Models (1.1)-(1.2) and (1.7)-(1.8), let A be the first eigenvalue 
of XX' and £ be the corresponding eigenvector. There is a generic constant 
5 = <5(a, [3,9) > 0 such that with probability 1 — 0(p ~ 3 ) ; 

min{||Vre£+ £||oo, \\y/n£ - I\\oo} < P~ 5 ■ 

The claim continues to hold if we replace the model (1.2) by (1.15) for 
A = I n and B such that max{||H||, ||£> -1 |[} < L p . 

Lemma 3.2 (Large-deviation on Folded Normals). As n —» oo, for any 
h > 0 and 0 < x < yfn/\og(n), and n independent samples Zi from 1V(0,1), 

P(\y ti 1 (kt + h\ - E[\zi + h|])| > y/nx) < 2 exp(-(l + o(l))x 2 / 2 ), 

where o(l) —> 0, uniformly for all h > 0 and 0 < x < y/n/ log(n). 

We now show all theorems about upper bound. Since Theorems 1.2-1.3 are 
special cases of Theorem 1.6 with B = I p , it suffices to show Theorems 1.6 
and 3.2-3.3. As there are four methods involved, it is more convenient to 
prove in a way by grouping the items associated with each method together. 
Fixing (a, (3,9) £ (0, l ) 3 and viewing all statements in Theorems 1.6 and 
Theorems 3.2-3.3, what we need to show can be re-organized as follows (for 
the statements regarding I, we need to prove that they hold for a general B 
where max{||B||, ||B^ 1 ||} < L p ). 

• (a). Simple Aggregation. Consider the case 0 < [3 < (1 — 9)/2. In 
this range, ?? p * 9 (/3) < Vg' u {/3). All we need to show is that if a < 
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r I(f u (P), then H sa) = £ with probability at least 1 — 0(p 3 ), and that 
if additionally a < r]g‘ 9 ((d), then Harnm p (£'* ,a \ a, /3, 9) —> 0. 

• (b). Sparse Aggregation. Consider the case (1 — 9)/2 < f3 < 1. In this 
case, r/ g lu ((3) < rj s g 9 {fd). Letting N = \pep], all we need to show is that 
Hamm p (S^ a \a, P,9) —> 0 if a < p s g l9 {/3) and Hamm p (£^ a \a, f3,6) —> 
0 if additionally a < r] g lu (/3). 

• (c). Classical PCA. Consider the case (1 — 9)/2 < /3 < 1/2 where only 
computationally tractable bounds are concerned and fj g lu (/3) = fj g l9 (/3). 
All we need to show is that if a < fj g ,u (/3), then I* = ±£ with 
probability at least 1 — 0(p~ 3 ) and that Hamm p (5'i , ^\ a, /3, 9) -» 0. 

• (d). IF-PCA. Consider the case 1/2 < j3 < 1 where only computa¬ 
tionally tractable bounds are concerned and fj g lu (/3) < fj g t9 (/3). All we 
need to show is that if a < fj g i9 (/3), then Sq^ = S(p) with prob¬ 
ability at least 1 — 0 (p -3 ); and if additionally a < i) g u (f3), then 
Hamm p (i^, a, /3, 9) —> 0. 

Consider (a). Note that li sa ' 1 = sgn(^j =1 Xj ) and J2j=i x j ~ N(\\p\\oT£,pI n ). 
By (3.3), ||/i|| 0 r = p 1 -^""(l+o(l)). Hence, a < r^“(/3) implies \\p\\ 0 r » ^p , 
and it follows that li sa ^ = £ with overwhelming probability. Once = £, 
yi sa ^ = n _1 / 2 X7 ~ N{y/nn, I p ). Noting that a < rj g l9 (/3) implies yjnr 1, 
we have Hamm p (5'i sa ^) — > 0 with overwhelming probability. Consider (c). 
The first claim is a direct result of Lemma 3.1, and the second claim can 
be proved similarly as in (a). Consider (d). Recall that the column-wise test 
statistic Q(j) is approximately distributed as N(0 ,1) for useless features and 
N{\Jn/2r 2 ,1) for useful features. Sor> n -1 / 4 will assure successful signal 
recovery, which translates to a < fj g t9 (/3). Once Sq ^ = S(n), we restrict our 
attention to X s ^\ the sub-matrix of X restricted to the columns in S{p), 
and the claim of Lemma 3.1 continues to hold by adapting the proof there 
(see the Appendix for details). So Hamm P (£q^) —> 0 with overwhelming 
probability. It remains to prove (b). 

We now show (b). Define pffi such that p^ a \j) = r p • 1 {) € S^}. Write 
S^ = S, p^ = p. £^ = £ and s p = pe p . With probability 1 — 0(p~ 3 ), 

(3.3) \\\p\\o - s p | < Cy4 p log(p). 

Since any event of probability 0(p ~ 3 ) has a negligible effect to the Hamming 
distances, we always condition on a fixed realization (£, p) that satisfy (3.3); 
so the probabilities below are with respective to the randomness of Z. To 
show (b), all we need to show are 
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• (bl). Hammp^a,^,^) —> 0, if a < rjf u ([i). In this item, the matrix 
B may be any matrix that satisfies max{||l?||, ||S _1 ||} < L p . 

• (b2). Hammp(5, a, fi, 9) —> 0, if a < r] p l9 (/i). In this item, B = I p . 

Consider (bl) first. It suffices to show 

(3.4) rT l (t,£) —»• 1. 

For any realized p, we construct p as follows: 

• If II^Ho > N, replace \\p\\o — N nonzero entries by 0. 

• If \\p\\ 0 < replace N — ||/i||o zero entries by r p . 

Let S be the support of p. Write Xp = \\Bp\\■ [Z(Bfi/\\Bp\\) + (p, p/\\Bp\\)£], 
where Z(Bp/\\Bp\\) ~ N(0,I n ) and (p, p/\\Bp\\) > ||S|| _1 r p% /s)i = L p p (1_/3 ~ 2a)/2 , 
with (1 — fi — 2a) > 0 in our range of interest. According to Mills’ ratio [41], 
with probability 1 — 0(p ~ 3 ), the absolute value of standard normal variable 
is bounded by y 1 6 log(p), which is less than _L p p( 1_ ^ _2a )/ 2 when p —> oo. It 
follows that with probability at least 1 — 0(p~ 3 ), sgn (Xp) = £. Furthermore, 

£'Xp = \\Xp\h = t p || XyesZilli- Since that £'Xp = \\Xp\h = t p \\ X^es^'lli 
and that S solves the optimization problem (3.1), 

(3.5) l'Xfi > (!Xfi. 

Write (!Xfi = (£,£)(p,p) + £' ZBp. We aim to obtain an upper bound for 
\£'ZBp\ (an upper bound for \£'ZBp\ can be obtained similarly). Denote 
by ( ZB) S the sub-matrix of ZB containing columns in S. Then \£'ZBp\ < 
V™\\(ZB)S\\ ||/i|| < ^nSpT p \\(ZBf\\, where \\(ZB) S \\ < ||5||||Z^|| < L p max| 5 | =Ar \\Z S \\. 
By classical RMT [45], max .y \\Z S \\ < L p max{i/n, y/Sp} with probability 
at least 1 — 0(p~ 3 ). Inserting them into (3.5) gives 

(3.6) (£,£)(p,p) > n(p, ff) - L P y/nsfT p (y/n + y/sf). 

First, (p, fi) < max{ \\p\\o, N}r p ~ s p t p . Second, by (3.3) and the definition 
of fi, (p, fi) = s p T 2 ( 1 + o(l)). Inserting these into (3.6) gives n~ l (£,£) > 

1 — L p (y/n + y/sf) / (Tpyjnsp). When a < ijg lu (/3), the second term on the 
right hand side is < p~ s for some 5 = 5(a, fi, 6) > 0, and (3.4) follows. 

We now consider (b2). Let p and S be the same as above. Due to (3.3), 

| S Fl S(p) | > |5(//)|(1 + o(l)). It suffices to show that 


( 3 . 7 ) 


\sns(p)\ > \snS(p)\-o( Sp ). 
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Since \S\ = |S| = N and that S solves the optimization (3.1), 

n n 

(3.8) G(S) = JV-V2 £ j £ Xi (j) | > JV 1 / 2 £ j Y, x iU)\ = G(S). 

i=l jeS i=1 jeS 

For any S C {1, • • • ,p} such that |S| = N, we define Wi(S) = N ~ 4 / 2 ^i(j) 

and h(S) = N~ 1 ^ 2 \S !1 S(p)\t p . It follows that G(S) = Y^i=i \ w i(S) + h(S)\, 
where Wi(S ) A r (0,1), 1 < i < n. For any h > 0, we define the function 

u(h) = E x ~N(o,i)(\ x +h\)- Let E p be the event that {max 5 C { 1 ... }P y,\s]^N |G(S')- 
u{h{S))\ < ^ QN \og{p) / n} . By Lemma 3.2 and the fact that there are no 
more than p N such S, P{E p = 0(p~ 3 )', so those realizations Z in E£ has a 
negligible effect. Combining it with (3.8) gives 

(3.9) u(h(S)) > u(h(S )) - L p \Js p \og(p)/n. 

The following lemma is proved in the Appendix. 

Lemma 3.3 There exists a constant C > 0 such that for any 0 < h\ < h 2 , 
u(h 2 ) - u(hi) > C min{(/r 2 — hi), (h 2 - hi) 2 }. 

Since h(S) < h(S) for any S with |S| = N, by Lemma 3.3, 

(3.10) u(h(S)) > u(h(S)) - Cmm{h(S) - h{S), [h(S) - fi(S)] 2 } 

We combine (3.9)-(3.10). It yields that 

(3.11) 0 < h ( S ) h ( S ) < LpTf 1 max {(log(p)/n) 1/2 , (log(p)/ns p ) 1/4 | . 

The assumption a < if e l9 (l3) implies t p < p~ s min{n -1 / 2 , (ns p )" 1//4 }. So the 
right hand side of (3.11) is o(l). Then (3.7) follows. 

4. Limits for hypothesis testing. The goal for (global) hypothesis 
testing is to test a null hypothesis 

(4.1) H [p) : Xi ~ AT(0, Ip), 1 <i<n, 

against a specific alternative in the complement of the null, 

(4.2) h['^ : Xfs are generated from Models (1.1)-(1.2) and (1.7)-(1.8). 

We consider three different tests. 
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The first test T ( f a ' > is connected to the idea of simple aggregation. Recall 
that x is the average of all columns. The idea is to test whether E[x\ =0 or 
not using the classical \' 2 . This test rejects H^ if and only if 

(2rt) _1/2 [p||z|| 2 -n]> 2^2log (p). 

The second test T^ a ' 1 is connected to sparse aggregation. Let be as in 

(3.1) . This test rejects H^ if and only if 

n ~ 1 ^ 2 \\ x illi > \/ 2 / 7rn + + 2 ) l°g(p)- 

jes & o) 

The third test T^ hc ) is connected to the Higher Criticism in Donoho and Jin 
[19]. Recalling that Q{j) = (2n) _1 / 2 (||a;j|| 2 —n) are the column-wise \' 2 -tests, 
the idea is to test whether some of the Q(j )’s have non-zero means. 

• For 1 < j < p, obtain a P- value nj = P{(2n) -1 / 2 [x 2 (0) — n] > Q(j)}- 

• Sort the P- values in the ascending order: 7rm < 7r( 2 ) < ... < ^( p )- 

• Compute the Higher Criticism statistic HC* = max^ 1<i<p/ / 2 } HC Pt i, 
where HC Pti = p[(i/p ) - 7T(*)]/[?T(i)(1 - 7T(i))] 1/2 - 

The test rejects if and only if HC* > 2 -J 2 log log(p). 

The test T < f) a ' > is similar to a test in [5], which is designed for the case 
that there is (unknown) dependence among features and so the test is more 
complicated than ours. The other two tests are newly proposed. 

For any testing procedure T that tests H\ against H( , we measure the 
performance by the sum of Type I and Type II errors: 

(4.3) Err(T,a, (3,6) = P ( P )(T rejects H^) + P 1 ( P )(T accepts H^), 

M o 

where the probabilities are with respective to the randomness of Z ). 

In Section 1.6, we have introduced two curves r/() yp {(3) and rjg yp (/3). The 
following theorem is proved in Section 5. 

Theorem 4.1 (Statistical lower bound for hypothesis testing). Fix (a, /3, 6) £ 
(0, l) 3 with a > r]g yp (/3). Consider the testing problem (4-l)-(4-2) for Models 

(1.1) -(1.2) and (1.7)-(1.8). For any test T, Err(T, a, (3,0) > 1 as p —> oo. 

Consider the upper bound. By the definitions (see (1.16)), when a < rjg yp {l3), 
we have either a < 'r\( yv ' X {(3) or a < Pg VP ' 2 (/3), or both. 
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Theorem 4.2 (Statistical upper bound for hypothesis testing). Fix (a, j3, 9) E 
(0, l) 3 such that a < 7jg yp (/3). Consider the testing problem (f.l)-(f.2) for 
Models (1.1)-(1.2) and (1.7)-(1.8). As p —> oo, 

• Err(T* sa \a, (3,9) —» 0 if a < pg yp,1 ((3). 

• Err(T^ a \a, (3,9) —> 0 if a < Vg yp,2 {/3) and we take N = \pe p ]. 

Theorem 4.3 (CTUB for hypothesis testing). Fix (a, (3,9) E (0, l) 3 such 
that a < rjg yp (/3). Consider the testing problem (f.l)-(f.2) for Models (1-1)- 
(1.2) and (1.7)-(1.8). As p —> oo, 

• Err(T* sa \ a, (3,9) -> 0 if 0 < p < 1/2. 

• Err(f( hc \a,P,9) -> 0 */1/2 < 0 < 1. 

4.1. Proofs of Theorems f.2-f.3. Similarly, as three tests are involved, it 
is more convenient to prove the results in a way by grouping items associated 
with each test separately. Fixing (a, /?, 9) E (0, l) 3 and viewing the two 
theorems, the following is what we need to show. 

• ( Simple Aggregation ). When a < r]g yp,1 (l3), Err(T* sa \a, (3,9) —> 0. 

• ( Sparse Aggregation). When a < r/g VP,2 (/3), Err(T^ a \ a, /?, 9) —> 0 if 
we take N = \pe p ~\. 

• (HC). When 1/2 < 0 < 1 and a < 9/4, Err(T^ hc \a, 0,0) -> 0. 

In the above, (c) is an easy extension of [19], so we omit its proof. Below, 
we prove (a) and (b). Consider (a). T* is defined through x, where x ~ 
^V(P ||MlloT,p -1 /„). So the claim follows directly from the tail probability 
of chi-square distributions. Consider (b). Under H^ p \ for each fixed S with 
\S\ = N , we can write x / 2 1| Y)j^s x llli = Y^=i\ w i\i where wfs are iid 
standard normal variables. Since E(\wi\) = y/2/ir, by Lemma 3.2, T< 
y/ (2/-7r)n + y/2n(N + 2) log(p) with probability 1 — 0(p~ 2 ). On the other 
hand, it is seen that 

Tn ] = max fe{±i} n ,Aie{o,i}p,||At||o=A r ^ / ^(/ x /H/ i ll)- 

Under h[ p \ let ft be defined in the same way as Section 3.1 and so T> 
i'Xfi/\\fi\\ = n{fx, A)/IIAll Zfi/\\fi\\. Since |S(^)| ~ pe p with probability 1 — 
0(p~ 2 ), (n,p)/\\p,\\ > ||/r||(l+o(l)). Moreover, \\i'Zfi\\ < Cy/n\\il\\(y/n+VN) 
with probability 1 — 0(p~ 3 ), by classical RMT [45]. Combining the above 
gives > n\\fi\\ — Cy/n{y/n + y/Wp) > n||/r||/2, where the last inequality 

is because a < ry j vp ' 2 (P) implies t p 2> max{?r~ 1//2 , sf 1 ^ 2 }. Therefore, > 
nr p y/~N/2^> max{n, y/nN log(p)}, and the claim follows. 
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5. Proofs of Theorems 1.1, 1.5, 3.1, and 4.1 (lower bounds). 

5.1. Proof of Theorem 1.1. For each 1 < i < n, consider the testing of 
two hypotheses, H^_\ : ii = — 1 versus : i{ = 1. Let be the joint 

density of X under respectively. Since ii = ±1 with equal probabilities, 
it follows from the connection between L 1 -distance and the sum of Type I 
and Type II testing errors [44] that for any clustering procedure i, P(ii ^ 
if) > 1 — || — f^ ||i. Comparing this with the desired claim, it suffices to 

show that for all 1 < i < n, 

(5.1) || f^ — f^ ||i = o(l), where o(l) —> 0 and does not depend on i. 

We now show (5.1) for every fixed 1 < i < n. For short, we drop the 
superscript “(f)” in and H±\. Recall that X = ig! + Z. Denote i = 
i — iiei, where e* is the z-th standard basis vector of R n \ note that ii = 0. 
By basic calculus and Fubini’s theorem, 

||/_ - /+ ||i = E[ | J sinh(X»e- | ^ll 2 / 2 e^-^- 1) ll /i ll 2 / 2 dP(//)dP(^)|] 

< E [J\J sinh(X'^)e- |lAt||2/2 e^ A> - (n - 1)ll/x||2/2 dF(/r)|(iF(£)] 

where E denotes the expectation under the law of X = Z. Seemingly, to 
show (5.1), it suffices to show that for every realization of i, 

(5.2) E[\j sinh(X'^)e- |l ^ |2/2 e^- (?l - 1)ll/i||2/2 ( IF(/r)|] = o(l); 


note that the left hand side does not depend on i and l. We now show (5.2) 
for the cases of (3 > (1 — 6) and /3 < (1 — 8), separately. 

Consider the case /3 < (1 — 9) first. Introduce V = (n — 1 )~ 1 / 2 X'£; note 
that V ~ N((n — l) 1 / 2 /!, I p ). Let gfl}, and gff be the joint densities 
of (Xi , V) for the cases of X t = —g + z, Xi = g + z, and X r = z, where 
z ~ N(0,I p ) and is independent of g (in all three cases, V = (n— 1 Y^g + z 
where z is independent of (g, z)). By the triangle inequality and symmetry, 


Ilsf’-sflliSllst’-sf’l 


I II „(*) _(*)|| _ oluW „(*)| 

i + I|S+ - % 111 - 2 \\ 9 + ~ 9 o I 


We recognize that the left hand side of (5.2) is nothing else but ||<?^ — 
gfY ||i. Combining these, to show (5.2), it is sufficient to show 

\\9+ -Spill = °( 1 )- 


(5.3) 


28 


Now, denote by A(f, g ) the Hellinger affinity for any two densities / and g. 

Denote h p (V(j )) = e p e^ T P v ^-^ n -^ T p/ 2 /[l-e p +e p e^ T P v ^-^-^ T p/ 2 l 
By definitions and direct calculations, A{g+ ,g$) equals to 

n p j=1 E{ [1 + h p (V(j))(e T ^- T p/ 2 - 1)] 1/2 } = (E{ [1 + h p {V{l)){e T * Xi{ M- T }/ 2 - 1)] 1/2 }) P . 

Write for short u = X^( 1) and w = 1/(1). According to [44, Page 221], for 
any probability densities / and g, \\f — g||i < 2^/2 — 2 A(f, g). Combining 
this with the expression of A(g+\ g^), to show (5.3), it suffices to show 

(5.4) E[( 1 + h p (w)\e T ^ u - T p> 2 - 1]) 1/2 ] = 1 + oip- 1 ). 

Note that for any x > —1, \y/l + x — 1 — x/ 2 | < Cx 2 , 

\E[(1 + h p (w)[e TpU - T p/ 2 - 1]) 1/2 ] - E[ 1 + Mp)( e ^«-r P 2 /2 _ !)]| 

(5.5) < CE[h 2 p (w)(e TpU ~ Tp / 2 - l) 2 ]. 

On one hand, due to the independence between w and u and the fact that 
E[e TpU - Tp / 2 } = 1, we have E[h p (w)[e TpU ~ T p^ 2 —l]\ = 0 and E[hl{w)(e T P u - T p /2 - 
l) 2 } = E[h 2 (w)}E[(e T P u ~ r p / ' 2 — l) 2 ]. On the other hand, since h p (w) < 
epe V^=lr vW -(n-i)r^/ 2 ^ by direct ca i cu i at i 0 ns there is E[h 2 p (w)} < e 2 ^” 1 ^, 

and E[(e TpU ~ T p / 2 — l) 2 ] = e Tp — 1. Inserting these into (5.5) and invoking 
e p = r p = p~ a , and n = p e , 

\E[(l+h p (w)[e T P u - T ? /2 -l]) 1/2 }-l\ < Ce 2 p (e T p-l)e^- 1)T p < Cp~ 2fi - 2a e p9 ~ 2a . 

By the assumptions of a > Pg lu (/3) and (3 < (1 — 9), we have 2(f3 + a) > 1 
and 9 < 2a, and (5.4) follows. 

We now consider the case of f3 > (1 — 9). In this case, similarly, by basic 
algebra and Fubini’s theorem, the left hand side of (5.2) is no greater than 

E[ j |sinh(A'^)|e- | ^ l| 2 / 2 e^- (?l - 1 ) ^ l| 2 / 2 dF(/i)] 

= J F;[|sinh(A'^)|e" ll/i|| 2 / 2 e^ Y/i " (n ” 1 )ll,i|| 2 / 2 ]dF(/r) 

(5.6) = J E\sm\i(X'p)\e- ll ^ 2/2 dF(g), 

where in the last step we have used the independence between Xi and { X^ : 
k 7 ^ *, 1 < k < n}, and that E[e^ xp ~^ n ~ 1 ^ H^H 2 / 2 ] = 1 . Finally, let A p be the 
event of {g : |/r||o/(pe p ) < 2 }, and write 

J ^|sinh(X'//)|e- |l/x|| 2 / 2 dF(^) =/ + //, 


(5.7) 
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where 1 = J (E\ sinh(X'/t)|e ^^ 2 / 2 -lA p )dF(p), and II = J (E\ sinh(X'/r)|e W^W 2 / 2 - 
lAc)dF(n). 

By Cauchy-Schwarz inequality, (E\ sinh(A7//)|) 2 < E[( sinh(X'/r)) 2 ] = 
(e 2 H /i ll“ — l)/2 for any realized p in A p . Combining this with basic algebra, 
it follows that / < f (y / sinh(||/i|| 2 ) • \A p )dF{p) < yj sinh(2pe p r 2 ), where in 
the last step, we have used the fact that over the event A p , ||/j|| 2 < 2 pe p r p . 

By our assumption of r p = p ~ a , e p = p~d and a > gtf u (P) = (1 — /3)/ 2, 
pe p T p = o(l). Combining these gives 

(5.8) M1<o(l). 

At the same time, since | sinh(x)| < cosh(.x) for any x, 

(5.9) II < J(E cosh(X^)e-IWI 2 / 2 • 1 A^)dF(p) = P{A c p )- 

note that P{A p ) = o(l). We insert (5.8)-(5.9) into (5.7), and find that 
f E\ sinh(X(/i)|e _ H /i ll 2 / 2 di ?1 (/x) = o(l). Then (5.2) follows from (5.6). □ 

5.2. Proof of Theorem 1.5. Recall that Z has iid entries from iV(0,1). By 
elementary statistics 1, and conditions on A and B, there is a non-stochastic 
term c p such that (a) c p 1 < L p , (b) there is a random matrix W E R n,p 
such that c p Z + W has the same distribution of AZB {W is independent of 
(£,p,Z)). Compare two experiments 

Experiment 1. X = lp' + c p Z , Experiment 2. X = Ip + c p Z + W. 

(i) 

Fixing 1 < i < n, consider the testing of two hypotheses, : £i = — 1 ver¬ 
sus : li = 1. Let f± be the joint density of X under H±\ respectively, 
for Experiment 1, and let g± be the joint density of X under H± , re¬ 
spectively, for Experiment 2. By Neyman-Pearson’s fundamental lemma on 
testing [44], for any clustering procedure £, tight lower bounds for P(l t 7 ^ li) 
(expected Hamming error at location i) associated with the two experiments 
are 1 — \\f^ — f^ ||i and 1 — — gf' 1 ||i, respectively, where ||/ — c/||i de¬ 

notes the id-distance between two densities / and g. Le Cam’s idea can be 
solidified as follows: 

Theorem 5.1 (Monotonicity of ^-distance). \\g^ — < \\f^ — f^\\i- 

1 ' Note that ZB has the same distribution as c p Z + W, where Z has iid normal entries, 

c p is half of the minimum eigenvalue of BB 1 , and the columns of W follow N (0, BB 1 —c p I p ) 
distribution. Similar analysis for A(c p Z + W) gives the result. 
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Using this, Theorem 1.5 follows directly from the proof of Theorem 1.1. 

It remains to show Theorem 5.1. Without loss of generality, we assume 
i = 1, and drop the superscripts in g± and f± for simplicity. Let a E R n 

be the vector such that a* 2Bernoulli(l/2) — 1. For any realization of 
a, let £± = £±(a) E R n be the vectors of (±l,a , ) / , respectively. Let F(a), 
F(/j,), and F(w) be the CDF of a and g respectively, and let h(z) be the 
(joint) density of the matrix Z. It follows that g±(x) = f h(x — £±{a)jj! — 
w)dF(a)dF(g)dF(w), x E R n,p , and ||g + — g-\\\ equals to 


[h(x — £ + (a)fi' — w) — h{x — £-{a)g! 


w)\dF{a)dF{n)dF(w)\dx. 


Using Fubini’s theorem, this is no greater than f G(w)dF(w ), where G(w) = 
| f h(x — £ + {a)g! — w) — h(x — £-(a)g' — w)dF(a)dF(g)\dx. Note that for 
any fixed w E R n,p , A{w) does not depend on w and equals to ||/+ — /_ ||i, 
and the claim follows. □ 


5.3. Proof of Theorem 3.1. For each 1 < j < p, consider the testing of 
two hypotheses, : gfj) = 0 versus h[ : i> : /j(j) = r p . Let /^ and f[^ 
be the joint density of X under Hq' 1 and H± \ respectively. Since P(g(j) = 

T p ) = e p , it follows from the connection between L 1 -distance and the sum of 
Type I and Type II testing errors [44] that for any clustering procedure /l, 

P(sgnQa(j)) ± sgn(/r(j))) =(1 - e P )P(A(j) + 0| g{j) = 0) + e p P(fi(j) = 0| g(j) = t p ) 

>e p [l-(l/ 2 )||/ 0 W -/?' ) || 1 ], 

where in the last step we have used 11 ( 1 — e p )f^ — e p /) J * 11 1 = || (1 — 2e p ) fff^ + 
e p(fo^~ < (1—2e p )+e p ||/o ? ' ) — /^||i. Comparing this with the desired 
claim, it suffices to show that for all 1 < j < p, 

(5.10) ||/q^ — f[^ ||i = o(l), where o(l) —> 0 and does not depend on j. 

We now show (5.10) for every fixed 1 < j < p. We first consider the 
case fd < 1 — 9. For short, we drop the superscript “(j)” in /^ and f^K 
Recall that X = £/j! + Z = [xi,X 2 ,... ,x p ] and let ft = g — g(j)ej , where ej 
is the j -th standard basis vector of RP: note that fif j) = 0. Let E denote 
the expectation under the law of X = Z. By basic calculus and Fubini’s 
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theorem, 


||/o - /i||i = E[ |/[l- e r P (^>-nr p V 2 ] e ^A-n||/i|| 2 /2 dj p(~) dj p(^|] 

< J E[ |l-e^<^>- nr p/ 2 |e^- n ^ll 2 / 2 ]dF(/i)dF(^) 

(5.11) = /'^[|l-e T ^^ ) - nT p 2/2 |](iF(£), 


where in the last step, we have used the fact that Xj and Xp are independent 
and that E\e/' x ^~ n H^H 2 / 2 ] = 1. Additionally, note that E[ 11 —e Tp ^’^> _nr p/ 2 |] 
does not depend on t. Denote z = n^ 1 / 2 (£, az,); note that z 1V(0,1). 
Inserting these into (5.11) gives 

(5.12) ||/o - M|i = £ 0 [|1 - e^ z ~ nT H% 

where Eq denotes the expectation under the law of z ~ iV(0,1). By the 
conditions of a > and (3 < (1 — 9), we have a > 9/2, and nr 2 = 

pd- 2 a _ 0 (i), i n ti^s simple setting, it is seen that Eq [| 1 — e'^ riTpZ ~ nTp ^ 2 \\ = 
o(l). Combining (5.10)-(5.12) gives the claim. 

We now consider the case /3 > (1 — 9). In this case, rj s g l9 ((3 ) = r] g yp ({3), so 
intuitively, the claim follows by the argument that “as long as it is impossible 
to have (global) hypothesis testing, it is impossible to identify the signals”. 
Still, for mathematical rigor, it is desirable to provide a proof using the 
L 1 -distance. Similarly to that in the proof on the lower bound for global 
testing, write p = p + p{j) e j and let d p = (6pe p log(p)) 1 / 2 , A s be the event 
(IIAlio = s} and F s be the conditional distribution of p given the event of 
A s , 1 < s < p. Define a s = f e Tp ^ ,Xj ^ nTp/ ' 2 e £ ' x ^~ n ^^ 2//2 dF s (p)dF(£) and 
a s = f x ^~ n ^ 2 / 2 dF s {p)dF{t). It suffices to show that for all s such that 
I s ~~ P e p\ < d p that 

(5.13) E[(a s - a s ) 2 } = o{ 1). 

Let v be an independent duplicate of p. By similar arguments and not¬ 
ing that pi v = plv + t 2 and p’u = plv, we have E[a 2 \ = J[cosh(A / f' + 
T 2 )] n dF s {p)dF s {v), E[a 2 ] = J[cosh(p'v)] n dF s (p)dF s (v), and the cross term 
E[a s a s ] = f[cosh.(p'v)] n dF s (p)dF s (i>). Combining these terms and noting 
that cosh(x + y) = cosh(x)[l + tanh(x) tanh(y)], there is 


E[(a s - d s ) 2 \ 


J [cosh(A / f')] n {[l + tanh(r 2 ) tanh(A^)] n — 1 }dF s (p)dF s (v). 
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Now, over the event {(p,v) : \\p\\o = ||P||o = s}, where s ~ pe p , we have 
\p!v\ < sr 2 < pe p r 2 < pt 1- / 3-6 *)/ 2 ; note that by the assumption of r > 

Vg yP (P) and P > (1 — 0), the exponent (1 — P — 0)/2 < 0. As a result, it is 
seen that tanh (pH/) tanh(r 2 ) < p'vT p < pe p Tp, where pe p T' p = o(n _1 ) by the 
assumption of a > p s e l9 (P). Inserting this into (5.13) gives 

(5.14) E[(a s - a s ) 2 ] = o(l) • J [cosh (p'u)] n dF s (p)dF s (v). 

According to (5.17)-(5.18) in Section 5.4, the second term on the right hand 
side of (5.14) is 1 + o(l). This gives the claim. □ 

5.4. Proof of Theorem f.l. Recall that X = Ip! + Z. Let fo(X) and 
/i (X ) be the joint density of Z and X, respectively. It is sufficient to show 
that as p —> oo, under the conditions of Theorem 4.1, 

(5.15) ||/i — /o||i ~t 0. 

Recall that ||/i||o and \\p\\ denote the L°-norm and the L 2 -norm of p re¬ 
spectively. For 1 < s < p, let A s be the event A s = {||^||o = s}, F(£) 
and F(p) be the distributions of £ and p, respectively, and let F s (p ) be the 
conditional distribution of p given the event of A s . Introduce a constant 
d P = (6pe p log(p)) 1 / 2 , a set D p = {s : |s —pe p | < d p }, and functions a s (X) = 
j e t'Xn- ^llp|| 2 /2^ s ( j u)dF(^), 1 < s < p. Let E be the expectation under the 
law of X = Z. It is seen that f 1 (X)/f 0 (X) = f e l ' x t 1 - n M 2 / 2 dF(p)dF(£) = 
Y7 S = l P(A s )a s (X), and so ||/i - / 0 ||i equals to 

(5.16) E\Y / P s=i P(A s )(a s (X)-l)\<J2 Dp P(A s )E[\a s (X)-l\\+rem, 

where rem = Y1d c F(A s )E[\a s (X ) — 11]. Since FJ[|a s — 1|] < E[a s ] + 1 = 2, 
rem < 2P(A S ) < 2P(\\p\\o G Df). Note that ||^||o ~ Binomial(p, e p ), 

where pe p = with 0 < {3 < 1, it follows from basic statistics that 

rem = o(l). At the same time, by Cauchy-Schwarz inequality, (if[|a s (A) — 
1|]) 2 < E[(a s (X) — l) 2 ] = E[a? s (X)\ — 1. Combining these with (5.16), to 
show (5.15), it suffices to show that 

(5.17) E[(a 2 s (X)} < 1 + o(l), Vs G D p , 

where o(l) —> 0 uniformly for all such s as p —> oo. 

We now show (5.17). Fix an s G D p . Let v E R p be an independent copy of 
p, and let F s (v) be the distribution of (^|{||^||o = s})- Using basic statistics 
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and the independence of X,, 

a 2 s (X) = J e-”ll M ll 2 / 2_Tl ll v ll 2 / 2 n^ 1 [cosh(//Xj) cos\\{u'Xi)]dF s {n)dF s {u). 
First, by the independence of X t and basic statistics, E[a 2 (X)\ equals to 

0 (2k—n)\i'v 

i cr- 

k =0 ' 


(5.18) J [cosh (p!v)] n dF s (p)dF s (u) = y / 


-dF s (p)dF s (v). 


Recalling that any nonzero entry of p or u is T p , it is seen that over the 
event {||/i||o = |M|o = T p 2 {t JL -> u ) is distributed as a hyper-geometric 
distribution H(p,s,s). Write e p = s/p. As s E D p , e p ~ e p . Following [2], 
there is a a-algebra B and a random variable b ~ Binomial(s, e p ) such that 
r~ 2 (//, zz) has the same distribution as that of . Using Jensen’s in¬ 

equality, e ( 2k - n )Fv < _E[e^ 2fc_n ) r p 6 |R], for 0 < k < n. It follows that 


(5.19) E 


e (2 k -n)F^ dFs ^ dFs ^) < E [ e P fc-n)r p 2 fc] = (1 _^ + ^ e (2fc-n)r p 2 _ 


Inserting (5.19) into (5.18) and rearranging, 

(5.20) £[<*’(*)] < 2 _n ^ (?) [1 - e p + e p e^ k - n ^] s . 

k =0 ' 2 


We now analyze the right hand side of (5.20). Denote S by {1,2,..., n}. 
We split S as the union of three disjoint subsets S = S 1 US 2 US 3 , where S± = 
{k E S : \2k — n\ < yjn log(n)}, S 3 = {k e S : \2k — n\ > nAi/2log(n)npe p }. 

Also, let f p = p~ r >e VP (^). By our assumption of a > r]e(/3), there is a 
constant 5 = 6(9, a) > 0 such that r 2 = p~ 5 f 2 . We also claim that when 
a > Vq VP {(3), T 2 \2k — n\ = o(l) for any k E SiUS 2 ■ In fact, by definitions and 
direct calculations, we have '% yp (l3) > 9/2 when (3 < max{l — 9, (2 — 9)/ 4} 
and Pg yp {!3) = (1 + 9 — /3)/4 otherwise. In the first case, recalling n = p 6 , 
the claim follows since r 2 \2k — n\ < r 2 n = p e ~ 2a and a > 9/2. In the 
second case, noting that r 2 = p~ s t 2 = p~ s {npe p )~ 1 ^ 2 , it follows \ 2 k-n\r 2 < 
sj2 log (n)npe p ■ (p~ s (?ipe p ) -1 / 2 ) = o(l) for all k E Si U 52, and the claim 
follows. Now, since for any x E (—1,1) and y E R, l — e p + e p e x < l + 2e p \x\ < 
e 2ep \ x \ , and 1 — i p + e p e y < 1 — e p + e p eM < , 


(5.21) 

[(l-e p +e p e T ^ 2k -^) s < 


[1 + 2e p r 2 \2k - n|] s < e 2 P^\2k-n\^ k £ Sl U S 2 , 

(e T p\ 2k ~n\)s = e sr/\2k-n\^ jfc € S3. 
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If we take Y ~ Binomial(n, 1/2), then P(Y = k) = 2 n ( ? /). At the same 
time, by de Moivre-Laplace Theorem and Hoeffding inequality [41], 


(5.22) 


P(Y = k) 


~ (7rn/2) 1 / 2 e 
< e -(2fc-n) 2 /(2n) 


( 2 k—n) 2 /( 2 n) 

5 


k G Si, 
k G 52 U S 3 . 


Combining (5.21)-(5.22), we have the following. First, the summation over 
k G Si is smaller than that that (0 is the probability density of IV(0,1)) 


(5.23) (2/Vn) e 2pe 1 T p |2fc - n| </»(^-=^) 


fceSi 


r^og(n) 

' - log(n) 


3 2 pelr^y/nx 


4 >(x)dx. 


By the assumption of a > Pg VP (/3) and basic algebra, we have a > (2 + 0 — 


4/3)/4. It follows that pepTp-ySr ~ 2p~ s pe^fp^/n = 2 -p- s ■p( 2Jr6 -*P)/ 2 - 2r ie vp (P)^ 
where the exponent is negative. It follows that the right hand side of (5.23) 
is 1 + o(l). Second, let II be the summation over k G S 2 , then 

(5.24) II < ^ g2p£pTp |2fc—n| g—(2fc—n) 2 /(2n) < ^ e ~( 2 k-n) 2 /( 2 n) ^ 
k£S 2 k£S 2 


where the second inequality is because 2pe 2 r 2 < 2p S /y/n < |2/c — n|/(4n). 
The right hand side does not exceed ne -log ^ = o(l) since \2k — n\ > 
y/n\og(n). Last, we consider the summation over k G S 3 . We only consider 
the case of f3 > (1 — 0) since only in this case S 3 is non-empty. Note that 
in this case, n A y/2 log(n)npe p = y/2 log(n)npe p and that for any k G S 3 , 
STp < P~ S pepTp < [2 k- n|/(4n), 


(5.25) 


HI < ^ g ST p \‘2k—n\ — (2k—n) 2 / (2n) < ^ 


D —(2k—n) 2 / (4n) 


k(zS 3 


kes 3 


which < ne l °g( n )P e p/ 2 = o(l). Combining (5.23)-(5.25) with (5.20) gives the 
claim. □ 


6. Discussions. We have studied the statistical limits for three inter¬ 
connected problems: clustering, signal recovery, and hypothesis testing. For 
each problem, in the two-dimensional phase space calibrating the signal spar¬ 
sity and strength, we identify the exact separating boundary for the Region 
of Possibility and Region of Impossibility. We have also derived a compu¬ 
tationally tractable upper bound (CTUB), part of which is tight, and the 
other part is conjectured to be tight. Our study on the limits are extended 
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to the case where the parameters fall exactly on the separating boundaries 
and the case of colored noise. 

We propose several different methods, including IF-PCA. IF-PCA is a 
two-fold dimension reduction algorithm: we first reduce dimensions from 
(say) 10 1 to a few hundreds by screening, and then further reduce it to just 
a few by PCA. Each of the two steps can be useful in other high-dimensional 
settings. Compared to popular penalization approaches, our approach has 
advantages for it is highly extendable and computationally inexpensive. 

The work is closely related to Jin and Wang [31] but is also very different. 
The focus of [31] is to investigate the performance of IF-PCA with real 
data examples and to study the consistency theory. The primary focus here, 
however, is on the statistical limits for three problems including clustering. 
The paper is also closely related to the very interesting paper by Arias- 
Castro and Verzelen [5]. However, two papers are different in important 
ways. 

• The focus of our paper is on clustering, while the focus of their paper 
is on hypothesis testing (without careful discussion on clustering). 

• Both papers addressed signal recovery, but there are important differ¬ 
ences: we provided the statistical lower bound but they did not; the 
CTUB they derived is not as sharp as ours. See Figure 2. 

• Both papers studied hypothesis testing, but since the models are dif¬ 
ferent, the separating boundaries (and so the proofs) are also different. 
See Sections 1.6 and 4 (also Figure 2) for details. 

• Both papers studied the case with colored noise, besides the different 
focuses (clustering v.s. hypothesis testing), their setting in the colored 
case is also different from ours. In their setting, coloration makes a 
substantial difference to statistical limits. 

For these reasons, the methods and theory (especially that on IF-PCA) in 
our paper are very different from those in [5]. With that being said, we must 
note that since two papers have overlapping interest, it is not surprising that 
certain part of this paper overlaps 18 with that in [5] (e.g., some parts of the 
separating boundaries and some of the ideas and methods). 

The paper is related to recent ideas in spectral clustering (e.g., Azizyan 
et al [7], Chan and Hall [13]; see also [38, 39, 43, 49]). In particular, the high 
level idea of IF-PCA (i.e., combining feature selection with classical rneth- 

18 Compare the critical signal strength required for successful hypothesis testing/signal 
recovery in our paper with those in [5], we note some discrepancies in terms of some multi- 
logarithmic factors. This is due to that we choose a simpler calibration than that in [5]: 
all the parameters (n, e, r) are expressed as a (constant) power of p and multi-logarithmic 
factors are neglected. Such a calibration makes the presentation more succinct. 
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ods) is not new and can be found in [7, 13], but the methods and theory are 
different. Azizyan et al [7] study the clustering problem in a closely related 
setting, but they use a different loss function and so the separating bound¬ 
aries are also different. Chan and Hall [13] use a very different screening idea 
(motivated by real data analysis) and do not study phase transitions. 

Our work is closely related to recent interest in the spike model (e.g., 
[4, 36, 47]). In particular, mathematically, Model (1.2) is similar to the spike 
model [32], and theoretical results on one can shed light on those for the 
other. However, two models are also different from a scientific perspective: 
(a) two models are motivated by different application problems, (b) the 
primary interest of Model (1.2) is on the class labels ti, which are sometimes 
easy to validate in real applications, and (c) the primary interest of spike 
model is on the feature vector p,, which is relatively hard to validate in real 
applications. The focus and scope of our study are very different from many 
recent works on the spike model, and most part of the bounds (especially 
those for clustering and IF-PCA) we derive are new. 

This paper is also related to the recent interest on computationally tractable 
lower bounds and sparse PCA [9, 11], but it is also very different in terms 
of our focus on clustering and statistical limits. It is also related to the 
lower bound for hypothesis testing problem [1] and the sub-matrix detec¬ 
tion problem [37], but the model is different. Recovering of i and p, can also 
be interpreted as recovering a low-rank matrix from the data matrix, which 
is closely related to the low rank matrix recovery studies [12], In terms of 
the phase transitions, the paper is closely related to [19] on signal detection, 
[20] on classification, and [33] on variable selection, but is also very different 
for the primary focus here is on clustering. 

For simplicity, we focus on the ARW model, where we have several as¬ 
sumptions such as £i = ±1 equally likely, the signals have the same sign 
and equal strength, etc. Many of these assumptions can be largely relaxed. 
For example, Theorems 1.1-1.3 continue to hold if we replace the model 
/i(j) (1—e p )^o +e p i / r p by that of /j(j) {l—e p )uQ+e p G p , where G p is a dis¬ 
tribution supported in the interval [ a p T p ,b p T p \ with 0 < max{a“ , < L p 

(a multi-log(p) term). Also, in Section 1.6, we have discussed the case where 

we replace fx(j) (1 — e p )zz 0 + tp v T v hi Model (1.7) by that of /x(j) 

(1 — e p )uo + ae p v_ Tp + (1 — a)e p u Tp for a constant 0 < a < 1/2. Theorems 
1.1-1.3 continue to hold if a / 1/2. If a = 1/2, the left part of the boundaries 
will change and the aggregation methods need to be modified. We discuss 
this case in detail in Section D. It requires a lot of time and effort to fully 
investigate how broad the main theorems hold, so we leave it to the future. 

The paper motivates an array of interesting problems in post-selection 
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Random Matrix Theory that could be future research topics. For the per¬ 
spective of spectral clustering, it is of great interest to precisely characterize 
the limiting behavior of the singular values (bulk and the edge singular val¬ 
ues) and leading singular vectors of the post-selection data matrix. These 
problems are technically very challenging, and we leave them to the future. 

Our paper supports the philosophy in Donoho [18, Section 10] that sim¬ 
ple and homely methods are just as good as more charismatic methods in 
Machine Learning for analyzing (real) high dimensional data. 

APPENDIX A: AN EXTENSION OF THE ARW MODEL 

We consider an extension of the Asymptotic Rare and Weak (ARW) model 
in Section 1.2, where Models (1.1)-(1.2) and the calibration (1.8) continue 
to hold but (1.7) is replaced by a more sophisticated signal configuration: 

(A.l) n(j) (1 - e)v 0 + a ■ e ■ V- T + (1 - a) ■ e ■ v T , 1 < j < p, 

where 0 < a < 1/2 is a constant. This extended model includes the orig¬ 
inal ARW as a special case with a = 0. In this extension, we allow the 
nonzero coordinates of the feature vector /r to have positive and negative 
signs. Due to such a change, we need to slightly modify the definition of the 
(normalized) Hamming distance for signal recovery: Hamm p (/t, a, (3, 6 ) = 
(pep ) -1 Sj =1 R(sgn(/x(j)) 7 ^ sgn(A(j))). The loss functions for clustering 
and hypothesis testing remain the same. 

When 0 < a < 1/2, with high probability, the majority of the nonzero 
coordinates of /i are positive, and the performance of the four methods in 
Section 1.1 is not affected. Furthermore, the statistical limits and CTUB for 
all three problems continue to hold. For brevity, we omit the details. 

The case of a = 1/2 is more delicate. In this case, the two aggregation 
methods turn out to be ineffective. In light of this, we introduce a variant 
of the Sparse Aggregation, where we cluster the n subjects by 

(A. 2 ) 4 a) =sgn(X^ a) ). 

Here, 

(A.3) ASv = arg m ax (Ue{ _i A i} P: || /i ||o=Arll^Ml|i- 

Also, we use estimate the sign of fi (i.e., for signal recovery), and 

use the test statistic 

(A.4) T^ a) = N~ 1 / 2 \\X ^ a) \\ 1 
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Fig 4. Top left: statistical limits for clustering (red), signal recovery (black), and hypothesis 
testing (blue); s = pe p . Other three panels: CTUB (green) for clustering (top right), signal 
recovery (bottom left) and hypothesis testing (bottom right), respectively. 


for hypothesis testing. Note that if we force n(j) E {0,1} in (A.3), then it 
reduces to the original Sparse Aggregation. 

Remark. We have not found a variant of Simple Aggregation that both 
achieves the statistical limit and is computationally tractable. However, in 
the less sparse case, the classical PCA turns out to be already optimal. 19 

We now present the statistical limits and CTUB for all three problems. 
They are different from the ones we present in the main paper [28]. First, 

19 Classical PCA for hypothesis testing is to reject the null hypothesis when the leading 
singular value of X is larger than ^/p+ Un + log(p); for signal recovery is as the description 
in Section 3. However, for signal recovery, since we need to estimate not only the support 
but also the sign of p, we slightly modify it to = sgn (y{j)) ■ l{|y(j)| > 2^/log(p)}, 

where y = Xl and denotes the class label vector estimated by classical PCA. 
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we look at the statistical limits. 


f (l + 0-2/3)/4, 

Vd U (P) = \ 0/2, 

{ (l-/3)/2, 

**<« = { (iV«/4, 

f (l + 0-2/3)/4, 

^(/3)= 0 / 2 , 

l (l + 0-/3)/4, 


/3 < (1 — 0 )/ 2 , 

(1 — 0)/2 < P < (1 — 9 ), 
(3>{l-0). 

/3 < (1 — 0 ), 

/3 > (1 — 0 ). 

^ < (1 — 0)/2, 

(1 - 0)/2 < /3 < (1 - 0), 
/3 > (1 — 0). 


Figure 4 (top left panel) displays the statistical limits for three problems. 
Comparing it with Figure 2 (top left panel), we find that : (a) the black curve 
(signal recovery) remains the same, (b) the red curve (clustering) remains 
the same, except for the segment on the left is replaced by r 4 = p/ (ns 2 ), (c) 
for the blue curve (hypothesis testing), the right most segment remains the 
same, while the other two segments coincide with those of the red curve. 

Achievability. The statistical limit of clustering is achieved by the classical 
PCA (the left segment) and the variant (A.2) of Sparse Aggregation (the 
right two segments). For signal recovery, the right two segments are achieved 
by the modified Sparse Aggregation (A.3), and the left segment is achieved 
by classical PCA. For hypothesis testing, the left segment is achieved by 
classical PCA and the right two segments are achieved by the modified 
Sparse Aggregation (A. 4). 

Next, we present a CTUB for each of the three problems: 


f (1 + 0 — 2/3)/4, 

vtiP) = { 0 / 4 , 

I (1-/9/2, 


P < 1 / 2 , 

1/2 < ft < 1 - 0 / 2 , 
/?>(! — 0 ). 


r 0/2, 

Vg 9 (P)= { (l + 0-2/3)/4, 
l 0 / 4 , 


P<(l-0)/2, 

(1 - 0)/2 < p < 1 / 2 , 
P > 1 / 2 - 


V h e yP (P) = { ^ 9 W4 ’ 


P < 1 / 2 , 
P > 1 / 2 - 


See Figure 4 (top right and the two bottom panels). 

Methods associated with CTUB. The CTUB for clustering is associated 
with the methods of classical PCA (left segment) and IF-PCA (right two 
segments). The CTUB for signal recovery is associated with the methods 
of classical PCA (left segment) and IF-PCA (right segment). The CTUB 
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for hypothesis testing is associated with the methods of classical PCA (left 
segment) and IF-PCA (right segment). 

Remark. We now make a connection to the recent literature on the Gaus¬ 
sian mixture learning (e.g.[7, 14]). In our framework, we calibrate with (e, r). 
In the latter, we calibrate with ||//|| and ||/i||o- For brevity, we only discuss the 
problem of hypothesis testing. The statistical limits for hypothesis testing 
can be (roughly) re-stated as follows: 

• y/™P <C s <C p: st 2 = yjp/n. 

• n«s< y/Ep: t = n^ 1 / 2 . 

• s <C n: t = (sn)^ 1 / 4 . 

Note that the first item corresponds to the non-sparse cases in the Gaussian 
mixture learning literature, where \\p \\ 2 = sr 2 = \Jp/n; the results match 
with those in, e.g., [7, 14]. The second one is part of the sparse case in 
the Gaussian mixture learning literature, where 1 <C \\p \\ 2 = p/n -C yjp/n 
and n <C \\p\\o <C y/np. The last one is also part of the sparse case, where 
||/i |[ 2 = yfsjn and s <C n. 

APPENDIX B: PROOF OF LEMMAS IN SECTION 2 

In this sectoin, we prove the post-selection random matrix theory results 
in Section 2, specifically Lemmas 2.1-2.4. 


B.l. Preliminary lemmas for Section 2. Lemma B.l states the 
well-known Bernstein inequality [41]. Lemma B.2 is a result from classical 
Random Matrix Theory [45, Page 21]. Lemma B.3 states some properties 
about columns of the matrix it is proved in Section D.l. 

Lemma B.l Let X\ , • • • , Xn be independent random variables with E[X^\ = 
0 and var(Xfc) < Vk, for 1 < k < N. Suppose E(\Xk\ m ) < Vkm\c m ~ 2 / 2 for 
all m > 2, where c > 0 is a constant. Then for all A > 0, 


N 


LIZ>i>^) < exp 


A 2 /2 


fc=l 


J2k=i v k/N + c\/VN 


Lemma B.2 Let A be an N xn matrix whose entries are independent stan¬ 
dard normal random variables. Then for every x > 0, with probability at least 
1 — 2exp(— x 2 / 2 ), 

VN - y/n — x < s min (A) < S mav fA) < VN + y/n + x, 


where s m i n (A) and s max (^4) are the respective minimum and maximum sin¬ 
gular values of A. 
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Fix q 0. With 61 — (1,0, • • • , 0) and z ~ ./V(0, Ip)-, wg introduce a few 
notations: 

7Tq 9) = -P(||z|| 2 > n + 2 -^/qn \og(p)), 

7r[ 9) = P(||z + V^^ei|| 2 > n + 2^/qn log(p)), 
a p q) = E [( z i 1 )) 2 ■ Hll-ll 2 > n + 2^qn log(p)}], 

&p 9) = E [( z ( 2 )) 2 ■ H\\ z + Vnr*e i|| 2 > n + 2yVlog(p)}], 

4 9) = ^[(-(i)) 2 ' 1{||2 + \Air*ei|| 2 > n + 2vVilog(p)}]. 

For notation simplicity, we omit all the superscripts. In the following lemma, 
m mi q \i , /o) and the event L> p are defined in Section 2. 

Lemma B.3 LetS(p) denote the support of p,, n m = E(\z(l)\ m ) and k 2 m (n) 
P(||z|| 2m ), w/iere £ rs - / N(0,I n ). Below, all the probabilities are conditioning 
on {l, p), and the o(l) terms are uniform for all realizations of (£,p,) in the 
event D p . 

(a) Fix j fi S(n). For any v G 5 n_1 and any integer m > 1 

E[{v'z^) 2 ] = o p , 

E{\lfzf > \ m ) < K m 7T 0 (l + o(l)), 

E {\\ z j q) \\ 2 ) = na p , 

E (\\zf\\ 2m ) = K2m(n)7T 0 (l + o(l)), 

Op = 7T 0 (1 + L p n~ l/2 ). 

(b) Fix j G S(p). For any v G S n ~ 1 and any integer rn > 1 

^[(P^) 2 ] = b p + ( C p - 6p)^^ 

£(K^)| m ) < K m 7Tl(l + o(l)), 

Pdl^H 2 ) = nb p + (cp - b p ), 

P(||^ 9) || 2m ) < 2 m K 2m (?r)7Ti(l + o(l)), 

b p = 7Ti(l + Lp7l -1 / 4 ), Cp = 7Ti(l + LpTiT 1 / 4 ). 

fcj m^ q \£,ia) = (p- |5(/i)|)7r 0 + \S(/i) |tti, 

mi q \i,p,) = (p - \S(ii)\)a p + |5(//)|[6p + n _1 (cp - 6 P )]. 

Remark. Lemma B.3 allows us to characterize the quantities m ^ and 
First, by (a)-(b), a p ~ 7To and b p ~ c p ~ 7Ti. Combining them with (c) 
gives that ml 9 "* ~ Second, we look at By Lemma D.l and Mills’ 
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ratio [41], ttq ~ d>(\J‘lq log(p)) = L p p 9 . Similarly, 7Ti ~ L p p v^i+l 2 . 
Plugging them into (c) gives 

m ^ ~ L p p 1_l? + pe p • L p p _ [ ( '^’ _ 'v / 9)+] 2 . 

This is the equation (2.1) in the main text of [30]. 


B.2. Proof of Lemma 2.1. Fix q > 0 and write Hq = Z^ q \Z^)' 
and S = Sq^ for short. Fix a realization (£,p). With probability at least 

1 - 0 (p~ 3 ), 


(B.5) US'] — m^\ < \J log(p). 

First, we consider q > q(/3,9,r ), so that m ^ < np~ s for some 6 > 0. Let 
k = \m^ + y/Qm^ log(p)]. Under (B.5), 

Amax(-^o) < max A max ((ZZ') T,T ), 

Tc{l,-J»},|T[<fc 


A+n(^ 0 ) > 


mm 


Tc{i,-,p},|T|<fc 


^.in(( 2 Z') T ’ T ), 


where for a matrix ^4, A^ lin (^4) denotes the minimum non-zero eigenvalue 
and A T ' T is the submatrix restricted to rows and columns in T. For each 
fixed T, we can write ( Z'Z) T,T = Zt(Zt) 1 , where Zt = (zj.j G T) is an 
n X |T| matrix with iid entries of IV(0,1). Using Lemma B.2, for each T, 
with probability at least 1 — 0(p~( k+3 ^), all non-zero eigenvalues of ( ZZ') T - T 
fall into 


(Vn+ V\T\ + y/6klog(p)) 2 , (y/n - y/\T\ - y/6k log(p)) 2 


= n± C yj nk log(p). 


Note that the number of subsets T such that |T| < k is no more than p k . 
Combining the above results, we find that with probability at least 1 — 
0(p~ 3 ), all non-zero eigenvalues of Hq fall into 

(B.6) n ± C\Jnm (?) log(p). 

The claim then follows. 

Next, we consider q < q(/3, 9, r), so that mn ^ > np & for some 6 > 0. Write 
for short 

Lo p = \J nm(i') + o(l)|S’( / u)|7ri, 

where 7Ti is as in Lemma B.3 and rn^ = |S'(//)|7Ti by definition. It suffices 
to show that with probability at least 1 — 0(p~ 3 ), 

H 0 — mi q ^I n || < Cujp. 


(B.7) 
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We now show (B.7). Fix a > 0. A subset A4 a of the unit sphere is 
called an a-net if for any v £ 5 n_1 , there exits u £ M a such that ||u— v\\ < a. 
The following lemma states some well-known results and its proof can be 
found in [45, Page 8 ]. 

Lemma B.4 Fix a £ (0,1/2). For any A4 a , an a-net of S n , and any 
symmetric matrix A £ R n,n , ||A|| < (1 — 2a ) -1 sup^^^Ht/Aul}. Moreover, 
there exists an a-net M* a of S n ~ l such that |A4* | < (1 + 2/a) n . 

By Lemma B.4 with a = 1/4 , there exists a subset A4*, such that \M.*\ < 9 n 
and sup„ eA< * v'Av > ||A||/2 for any n x n matrix A. Therefore, to show 
the claim, it suffices to show that for each fixed v £ Ai*, with probability 
> 1 - 0(9~ n p~ 3 ), 

(B. 8 ) \v'{Ho — I n )v\ < Cu p . 

We now show (B. 8 ). Fix v and define 

Wj = {v'zff - Op, for j Wj = {v'z^f - b p , for j £ S(p), 

where a p and b p are dehned in Section B.l. By (c) of Lemma B.3, = 

(p— \S(p)\)a p + \S(p)\b p + n _1 (c p — b p )\S(p)\. Since \c p — b p \ = o( 7 Ti), we can 
rewrite 

p 

(B.9) v'(H 0 - mi q h n )v = y ^Wj + o(n^ 1 \S(p)\ni). 

l=i 

Here 14/ 's are independent of each other. Applying Lemma B.3, we get the 
following results. For j S(/x), E(Wj) = 0, var (Wj) < 37ro(l + o(l)) and 
E(\Wj\ m ) < K 2 mK o(l + o(l)). For j £ \E(Wj) \ < \b p - c p \ = n, • o(l), 
var(Wj) < 37 Ti(1 + o(l)) and E(\Wj\ m ) < K 2 m^i(l + o(l)). So we have 

lE*W)l = o(1)|5(//)|tti, y]var(14/) < 3m (9) . 

1= 1 1=1 

We apply Lemma B.l with A = \f9p~ 1 m^(n log(9) + 21og(p) + log(2)). To 
check the moment conditions, we note that K 2 m = -EV~jv(o,i)(l^| 2m ) < 2 m m! 
for all m > 1. Furthermore, since m^/n —> oo, we have J/• vai(Wj)/p ~ 
3 /p S> A /^/p. It follows that with probability > 1 — 0(9~ n p~ 3 ), 

p 

lE^ll ~ 3\/log(9 )Vnm^ + o( 1 )| 5 (^)| 7 Ti. 

1=i 

This gives (B. 8 ), and the proof is now complete. □ 
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B.3. Proof of Lemma 2.2. We have shown the first claim in (B.7), 
noting that uj p ~ V nm ^ when r < Pg(/3) (see also (B.15)). 

We now show the second claim. Write for short Hq = Z^ q \Z q )'. The key 
is the following lemma, which is proved in Section D. 

Lemma B.5 Under conditions of Lemma 2.2, as p —^ oo, conditioning on 
any realization of(£,p) on the event D p , with probability at least 1 — 0(n~ 2 ), 

\n~ 1 tr(Ho) — mi^| < C \J m^ q l log(p), 

II-^oIIf > rz, -1 [tr(iiro)] 2 + Cn 2 m^. 

Let k be the largest integer that is no larger than / 2 . Since k n, 

for each fixed k x k submatrix of ZZ' , its rank is n with probability 1. 
Using (B.5), the rank of Ho is n with probability at least 1 — 0(p -3 ). Let 
Ai > A 2 > ■ • • > A n > 0 be the eigenvalues of Hq and write A = n -1 Ya= 1 Aj- 
For 5 p = VnrnML, (B.7) and Lemma B.5 imply 

n 

(B.10) |Ai - A n | < Ai5 p , A = + o( 8 p ), ^ A 2 > nA 2 + A 2 nS 2 , 

i =1 

for some constants A\,A 2 > 0. On one hand, 

n 

£(Ai - A ) 2 > A 2 nd 2 p . 

i= 1 


On the other hand, Aj — A < Ai — A for i satisfying A i > A; moreover, 
A — Aj < A\5 P for i such that A* < A. It follows that 


£(Ai - A ) 2 < (Ai - A) Y, ( A * - A ) + a ^p E ( A - A *) 

^ 1 i:Xi<X 

= [(Ai — A) + A\5 p \ (A i — A) 

i:Xi>X 

<n[(Ai-A) + A 1 <5 p ](A 1 -A). 

Now, if we write x = Ai — A, then x(x + A\5 P ) > A 2 5 2 . It follows that 


(B.ll) 


Ai - A > 


y/Af + 4A 2 - A, 


Combining it with the second equation in (B.10), we obtain that Ai 
m *^ + CVnrnW. 


□ IV 
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B.4. Proof of Lemma 2.3. Write for short A = £(Z^p^)'+(Z^' > p,^ 9 ' > )f. 

P||< 2 ||f||||Z^)||< 2 n||Z^^|| 00 , 
it suffices to show that with probability 1 — 0 (p~ 3 ), 

(B.12) ||Z ( V 9 ) l| 0 o<C'r;V^F- 

Note that 

(B.13) ||Z ( 9 V (9) ||oo = max | ^ (i)| = r* max | I- 

jES(n) 

Fix i and write Vj = (i) for short. Then Vj’ s are independent and E(Vj) = 

0 by symmetry. We apply Lemma B.3 with v = e\ and find that var(V)) < 
b p + 2|c p — b p \ = 7 Ti(l + o(l)). By Lemma B.l (the moment conditions can 
be verified using Lemma B.3), | YljeSfa) ^jfl — 2 y / |S'(/i)| 7 ri with probability 
1 — 0(p~ 4 ). It follows that with probability 1 — 0(p~ 3 ), 

(B.14) max | ^ zf ] (i)\ < Cy/\S(fi)\in = Cyfrnfi. 

icSOO 

Combining (B.13)-(B.14) gives (B.12). □ 


B.5. Proof of Lemma 2.4. Introduce 

A \q,P,r,6) 

f P~ |min{g,/3- f}, q < r, 

l /3 + (v^-^) 2 -|nrin{g^-f+ (Vg-^) 2 }, 9 > r. 

By elementary algebra, 

r < Pe(P) min (q, /3,r, 6) > 1/2, 


Moreover, using Mills’ ratio, 

r;^S(^| = Lp V 1 / 2 -A t ( ? ,An0)_ 

It follows that for some 5 > 0, 


mm 






(B.15) 


r < Pe(P) 


|5'(h)l 7r i A P 5 max{ Vm(i\ \/n}. 
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Consider the first claim. The proof is similar to that of (B.8), except that 
we take A = C^/p~ 1 m^ q '> log(p) when applying Lemma B.l. It follows that 
for any v G <S n_1 , with probability at least 1 — 0(p~ 3 ), 

\v'(H 0 - mi q h n )v | < CyJ m (?) log(p) + o(\S(p)\iri). 

By (B.15), the second term above is negligible and the claim follows. 

Consider the second claim. H — Hq = ||^ 9 )|| 2 ££ / + A, where with proba¬ 
bility at least 1 — 0(p~ 3 ), ||A|| < Cut* yjm ^ by Lemma 2.3. Furthermore, 

by elementary statistics, ||/i^|| 2 < Cm^T*) 2 . Note that = |5(//)|7 Ti 
due to the spherical symmetry of N(0,I n ). Together, we see that 

II H - H 0 \\ < Lp(y/n\S(n)\TTi + n 3/4 y/\S(n)\TTi). 

If q > q(/3,r,8 ), then m^ = o(n ) and (B.15) implies ||H — Hq || < p s n. If 
q < q(P,r,6), then n = o(m^) and (B.15) implies \\H — Hq\\ < p~ s Vnrn (®). 

□ 

APPENDIX C: PROOF OF LEMMAS IN SECTION 3 
In this section, we prove Lemmas 3.1-3.3. 

C.l. Proof of Lemma 3.1. We first show the claim for B = I p and 
then generalize it to any B satisfying max{||R||, ||f? _1 ||} < L p . 

Fix B = I p . We use 6 > 0 to denote a generic constant which only depends 
on (a, /5, 9) but may change from occurrence to occurrence. In our model, 

X = Ip! + Z. Let Hq = ZZ 1 — pl n . It is seen 
(C.16) 

XX ’-pin = [||/i|| 2 ll'+lpZ'+Zpl'}+ZZ'-pI n = [\\p\\ 2 Il'+lp'Z'+Zpl']+HQ. 

Since £ is a left singular vector of A', A£ = [||/r|| 2 (£, I) + (£, Zp)\£+(£ t , I)Zp + 

Hq^. Rearranging it, we have 

(C.17) = (In - (l/X)H 0 )- 1 [b 1 I + b 2 Z(p/\\n\\)], 

where bi = bi(£, Z,p ) = (l/A)-[ v / ra||/x|| 2 (£, £)+y/n(£, Zp)\ and b 2 = b 2 (£,Z,p) = 
(l/X)y/n\\p\\(£,£). Therefore, min{|| x /n£ -^loo, \\Vn£ + f||oo} is no greater 
than 
(C.18) 

niin{|&i—1|, \bi+l\}+\bi\\\£—(I n —(l/X)Ho)- 1 £\\ 00 +\b 2 \\\(I n —(l/X)Ho)~ 1 Z(p/\\fi\\)\\ 00 
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To show the claim, it is sufficient to show that with probability at least 
1 - o(p' 3 ), 

(C.19) min{|&i — 1|, \b\ + 11} < p~ 5 , \b 2 \<p~ 5 , 

and 

(C.20) 

||M4-(1/A)tfor^|oo < P~ 5 , \\(i n -^H 0 )- 1 z(p/\\p\\)\\ oo < 

We now show (C.19). Consider the first item. Since Z and p are in¬ 
dependent, we have that with probability at least 1 — o(p ~ 3 ), \{^,Zp)\ < 
||/i|| ■ ||Z(/i/||/x||)|| < 2||/r||- v /n. Combining this with the triangle inequality, 

min{&i — 1, bi + 1} 

<(n|g|| 2 /A)|cos(£,^) - 1| + |1 - (n\\p\\ 2 /X)\ + {^/n/\)\^, Zp)\ 
(C.21) <(n||/i|| 2 /A)|cos(^,g - 1| + |1 - {n\\p\\ 2 /X)\ + 2n\\p\\/X. 

At the same time, we rewrite (C.16) as 

(C.22) XX' —pl n = A + Ho, where A = ||/r|| 2 ££' + Ip!Z' + Zpl' for short. 

Note that A is a symmetric matrix of rank 2. For short, write v = ||^||~ 2 /x 
and a = a(£, p, Z) = (1 + 4n -1 [£'Zv+ ||-ZT|| 2 ]) 1//2 - Let A± be the two nonzero 
eigenvalues of A, and let q± be the corresponding eigenvectors. By elemen¬ 
tary algebra, 

(C.23) A±(A) = n\\p\\ 2 [{l/2)(lXa)+n~ l l'Zv], i]± oc (l/2)(l±a)£+Zi/. 

By elementary statistics, it is seen that with probability at least 1 — o(p~ 3 ) 
that n _1 [|£'Zz/| + ||Zz/|| 2 ] does not exceed 
(C.24)_ 

^^/logg)n■ 1 [(^/n||/i|^ 1 ) + n\\p\\-' 2 ] = C ^\og{p)[{y/h\\p\\y l + \\p\\~ 2 ]. 

Note that for (a, j3,9) in our range of interest, pe p r 2 > yfpjn = pi 1 - 0 )/ 2 . By 
the way p is generated, ||^|[ 2 ~ pe p r 2 . Therefore, with probability at least 
1 - o(p~ 3 ), 

(C.25) |M| 2 ~ pe p T 2 > p 5 \Zp/n, n\\p\\ 2 ~ npe p T 2 > p 5 y/ph. 

Inserting (C.25) into (C.24) gives that with probability at least 1 — o(p~ 3 ), 

|a — 1| < Cp~ s . Combining this with (C.23), 

(C.26) 

\(n\\p\\ 2 /X + ) - 1| <p~ s , (A-/A+) < p~ s , |cos(£, r] + ) — 1| < p~ S ■ 
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At the same time, by a direct use of the elementary Random Matrix Theory 
[45], 11 /foil = \\ZZ' — pl p \\ < Cy/pn. Combining these with (C.25)-(C.26) 
gives 

(C.27) ||(l/A + )if 0 || < Cy/pn/ (n||/r|| 2 ) < Cp~ 5 . 

This says that in (C.22), the leading eigenvalue of A is larger than that of Ho 
by p s times. By matrix perturbation theory, we have that with probability 
at least 1 — o(p~ 3 ), 

(C.28) |A+/A - 1| < p~ 5 , |cos( 7 /+,f) — 1| < p~ S . 

Combining (C.26) and (C.28) gives 

(C.29) \{n\\p\\ 2 /\) - 1| <p~ 5 , |cos(^,0 — 1| <P~ 6 - 


In particular, combining (C.25), (C.27), and (C.28) gives that with proba¬ 
bility at least 1 — o(p -3 ), 

(C.30) ||(l/A)ff 0 || < Cp~\ y/pFi/X < p~ 5 . 


Inserting (C.29) into (C.21) gives the first item of (C.19). 

Consider the second item of (C.19). Note that 1 62 1 < (n||/x||/A), where by 
(C.29), the right hand side < ||/r|| _1 . The claim follows directly from (C.25). 

We now show (C.20). Since the proofs are similar, we only show the first 
item. Let ei be the first base vector of R n . Note that by symmetry and by 
using the union bound, it is sufficient to show that with probability at least 
l-o(p- 4 ), 

(C.31) 

\e'i(I n ~ ^o) _1 ei - l| < P~ S , Wi(In ~ jHq) -1 (£ - lid)] < Cy/log(p). 

The first claim follows easily by (C.30) and basic algebra. For the second 
claim, write £ = (£\, £)', and let Z be the (n — 1) x p matrix consisting all 
but the first row of Z, and let Hq = ZZ l — pl n -It follows that 


J n -(1/A)ff 0 


l-(l/A)[||Z 1 || 2 -p], -(1/A )Z[Z \ 
-(1/A )ZZ U I n _i - (l/X)Ho ) ’ 


and 

(C.32) 

e' 1 {In-\Ho)-\£-£ie 1 ) = {e' l [I n -{l/\)Ho}- 1 e l )il/\)Z' l Z'[I n . l -{l/X)Ho\- 1 l 
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Now, since rows of Z are independent, Z\ and Z[I n - 1 — (l/X)Ho]~ 1 £ are two 
vectors that almost independent of each other; the only issue is that Z\ is 
correlated with A. To overcome the difficulty, we write 
(C.33) 

_ A Z[ZH k l _ ^ JjZgQ Z[ZH k l 
A ^' +1 ~ ^ +1 ' \\ZHfl\\' 

Now, for each k, Z\ and ZHjfl are independent, and so 
Z[(Z'H$£/\\Z'H$l\\) ~ JV(0,1). 

For fc-tli term, with probability 1 —o(p _4 ( fc+1 )), there is \Z[(Z'HqI/\\Z'Hq£\\)\ < 
y / 8 (A: + 1) log(p). Additionally, by basics in RMT [45], with probability at 
least 1 — o(p~ i ), WZHfiW < y/p(Cy/rvp) k for all k. 

(I / X) k+1 \\Z HqI\\ < ^/(n-l)(l/\) k \\ZH k \\ < (Cy/njj/ A) fc+1 . 

Combining these with (C.33) and the second term of (C.30), it is seen that 
with probability at least 1 — o(p -4 ), 

|(1/A )Z[Z'[I n _ 1 - (l/X)H 0 \- 1 £\<p~ s . 

Inserting this into (C.32) and using the first item of (C.31), the second item 
of (C.31) follows. 

For a general B , the proof is similar by noting that ||ZS|| < L p ||Z|| and 
the following lemma, which is proved below. 

Lemma C.l As n,p —> oo and p/n —>■ oo, for an n x p random ma¬ 
trix Z where Z(i,j ) 1V(0,1) and any non-random matrix B 6 R p,p 

such that max{||B||, ||R _1 |j} < L p , with probability 1 — 0(p 3 ), \\ZBB'Z' — 
tr(BB')I n \\ < Cyfrvp. 

C.2. Proof of Lemma 3.2. Letting be the CDF of 1V(0,1), denote 
the mean and variance of \z{ + h\ by u(h) and cr 2 (/i), respectively. It is seen 
that 

(C.34) u(h) = \/2/ 7 re ” ft2//2 + h[ 1 — 2<f>(— h)\, cr 2 (/i) = 1 +/i 2 — ^r 2 (/i). 

By Jensen’s inequality, E\zi + h\ > \E(zi + h)\ = h. It follows that 
(C.35) u{h) > h, a 2 {h) < 1 , 
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At the same time, we claim that as n — > oo, for any 0 < x < y/n/ log(n), 

n 2 

(C.36) P(\Y^{\zi + h\ — u(h)) | > >/ra®) < 2exp(-(l + 0 ( 1 )) ^^ ), 

where o(l) — > 0 as n —> 00 , uniformly for all h > 0 and 0 < x < y/n/ log(n). 
Combining (C.35) and (C.36) gives Lemma 3.2. 

We now show (C.36). Write for short Y{ = \z{ + h\. It is sufficient to show 
that 

n 2 

(C.37) p (^2 Y i >nu{h) + y/nx) < exp(-(l + o(l))—^—), 

1=1 ' 

and 

n 2 

(C.38) < n«(/i) - V™) < exp(-(l + o(l)) ^ ). 


1=1 


Since the proofs are similar, we only show (C.37). By elementary calcula¬ 
tions, the moment generating function of Y% is 

(C.39) My/s) = E[e sY ] = e s 2 / 2 [e hs <S>{s + h) + e~ hs <$>{s - h)], 

By Cramer-Chernoff Theorem ([10]), for any s > 0 and any y, 


(C.40) 


P(J_l ^ > ny) < 


-n(i;s-logAfy(s)) 


2—1 


We now show this (C.37) for the cases of h < 2 \og{y/n/x) and h > 2 log (y/n/x) 
separately. 

Consider the case where h < 2 \og(y/n/x). We wish to use (C.40) with 


1 x 


s = 


y = u(h ) + x/y/n. 


a 2 (h) y/n 

By our assumptions of h < 2 log (y/n/x) and 0 < x < y/n/ log(n), 

s = 0 (x/y/n) = o(l), hs < 2 log (y/n/x) (x/y/n) = o(l). 

Now, on one hand, since ylog 3 (l/y) — > 0 as y — > 0+, h 3 s = o(l) and 
h 3 s 3 = o(s 2 ). Combining this with elementary Taylor expansion, 

e ±hs = l±hs+ i ^^ + o(s 2 ). 


(C.41) 


2 
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On the other hand, applying Taylor expansion to <h(s ± h) and noting that 
0 is a symmetric function, 

(C.42) $(s ±h) = §(±h) + </>(h)s - H{h)s 2 + o(s 2 ). 

where we have used that the third derivative of <h is a bounded function. 
Combining (C.41)-(C.42) and re-arranging, 

(C.43) e hs <f>(s + h) + e~ hs <S>(s-h) 

=1 + 2 s<j)(h) + hs[<fr(h) — <h(— h)\ + h 2 s 2 / 2 + o(s 2 ) 

(C.44) =1 + u(h)s + h 2 s 2 /2 + o(s 2 ), 

where in the first step, we have used <J>(/i) + <f>(— h) = 1, and in the second 
step, we have used the expression of u(h) given in (C.34). 

We now analyze log[e fts $(s + h) + e _hs <h(s — h)]. Write for short w = 
e hs <fr(s + h) + e~ hs &(s — h) — 1. By (C.44) and \u(h)\ < h + 1 from (C.34), 

|to| < C max{(/i + l)s, h 2 s 2 }, and so 

| log(l + w) - w + w 2 / 2| < C|u;| 3 < C max{(h + 1)V, /i 6 s 6 }, 

where by similar argument as above, max{(h + l) 3 s 3 , h e s e } = o(s 2 ). Com¬ 
bining this with (C.44), 

log[e fts $(s + h) + e~ hs ®(s - h)} 

= log(l + w) 

=w — w 2 / 2 + o(s 2 ) 

=u(h)s + h 2 s 2 / 2 — [u(h)s + h 2 s 2 / 2] 2 /2 + o(s 2 ) 

=u(h)s + ( h 2 — u(h) 2 )s 2 / 2 — [ u(h)h 2 s 3 + h A s A / 4]/2 + o(s 2 ), 

where we note \u(h)h 2 s 3 + /i 4 s 4 /4| < C(h + l)h 2 s 3 + h 4 s 4 /4 = o(s 2 ). As a 
result, 

log[e fes $(s + h) + e~ hs &(s — h)} = u(h)s + (h 2 — u(h) 2 )s 2 / 2 + o(s 2 ). 

Combining this with (C.39) and the expression of cr(h) given in (C.34) and 
rearranging it, 

2 2 

ys-log[M Y {s)} = ( y-u{h))s-(l+h 2 -u(h) 2 )^-+o(s 2 ) = (y-u(h))s-cr 2 (h)'^-+o(s 2 ). 
Now, invoking s = y/n and y = u(h) + x/y/n gives 

ys - log [M y {s)\ = J {h) ( x /^d + o(l))- 
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Combining this with (C.40) gives the claim. 

We now consider the case of h > 2 \og(y/n/x). We wish to use (C.40) 
again, with the same y but a different s: s = x/y/n. In the current case, 
since x < y/n/ log(n), 

h —> oo, s —y 0. 

By the assumptions of h > 2 log (y/n/x) and s = x/y/n, and 

4>(h/ 2) < Cexp(-(log(Vn/x)) 2 /2) = o(s 2 ), 

it follows that max{4 >(—s — h),$(s — h)} < &(—h/2) = o(\)cj)(h/2), where 
the right hand side is o(s 2 ). As a result, 

(C.45) 

e hs $(s+h)+e- hs <f>(s-h) = e hs [l-${-s-h)+e- 2hs ${s-h )] = e /ls [l+o(s 2 )], 
and so 

logfe^^s + h) + e~ hs Q >(s — h)] = hs + o(s 2 ). 

Combining this with (C.39) and (C.47) and invoking s = x/y/n. and y = 
u(h ) + x/y/n, 

ys — log My (s) = (u(h) + x/y/n — h)s — s 2 /2 + o(s 2 ) 

= s 2 / 2 + o(s 2 ) 

(C.46) = s 2 /(2a 2 (h)) + o(s 2 ), 

where in the last two steps, we have used 
(C.47) 

h — u(h) = 2/i4>(— h) — 2 <p(h) = o(s), (j2 (^) = 1 + h 2 — u(h ) 2 = 1 + o(s). 

Inserting (C.46) into (C.40) gives the claim. □ 

C.3. Proof of Lemma 3.3. Denote by <E> the CDF of N(0, 1). By direct 
calculations, 

u{h) = y[2f/e~ h2 l 2 + h[ 1 - 2$(-/*)]. 

This implies u{h) —> y/2/ir when h —> 0 and u(h)/h —> 1 when h —> oo. 
Furthermore, 

u\h) = —2h<j)(h) + [1 - 2$(-/i)] + 2h(f)(—h) = 1 - 24>(-/i), 

^(/i) = 2<t>(—h) > 0. 


So u(h) is strictly convex and monotony increasing for h £ (0, oo). 
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Let ho be the unique solution of u'(h) = 0.9. Fix (hi, h 2 ) such that h 2 > 
hi > 0. If h\ > ho, by convexity, 


u(h 2 ) — u(hi) > u' (hi)(h 2 — h\) > 0.9(h 2 — hi). 

If h 2 < ho, using the Taylor expansion, for some h £ [hi, h 2 \, 

u(h 2 ) - u(hi) = u(hi)(h 2 - hi) + X u "(h)(h 2 - hi) 2 > Xu"{h 0 ){h 2 - hi) 2 . 

If hi < ho < h 2 , then we decompose the difference into u(h 2 ) — u(ho) + 
u(ho) — u(hi) and combine with the two cases we just dicussed, then we 
have that 

u(h 2 ) - u(hi) > 0 . 9 (h 2 - ho) + Ci(ho - hi) 2 . 

When h 2 — ho > ho — hi, then we have u(h 2 ) — u(h{) > 0.45(^2 — ho) + 
0.45(/io — hi) = 0A5(h 2 — hi)- otherwise, there is u(h 2 ) — u(hi) > Qfi[(h 2 — 
h 0 ) 2 + (ho — hi) 2 ] > C(h 2 — hi) 2 . Combining the three cases gives the claim. 

□ 


APPENDIX D: PROOF OF SECONDARY LEMMAS 
In this section, we show the proof of Lemmas B.3, B.5 and C.l. 

D.l. Proof of Lemma B.3. The following lemma is useful, which is 
proved below. 

Lemma D.l For any fixed q > 0, 

4 9) = ^(v /2 9 1 °g(p))( 1 + L p n~ 1/2 ), 

_(*) f 1 -L pP -^-Vd)\_ r > q, 

1 X $((>/?“ \/^)v /21 °g(p))( 1 + L P n~ 1/4 ), r < q. 

First, we prove (a). Write for short Zj = z and z ^ = z^. Since the 
distribution of is spherically symmetric, v'z ^ has the same distribution 
as e’iZ^ q \ for any v € 5 n_1 . It follows that E[(v'z^) 2 ] = E[(z^ q \ l)) 2 ] = a p . 
Furthermore, E(\\z^ |[ 2 ) = nE[(z^ q \ 1)) 2 ] = na p . 

Consider E(\v'z^\ m ). Again, by spherical symmetry, 

E(\v'z^\ m ) = E(\z^(l)D = E(\z(l)\ m l{z 2 (l)+\\z\\ 2 > n+2y/qn log(p)}), 

where 5 = (z(2), ■ ■ ■ ,z(n))'. Note that z is independent of z( 1) and ||51| 2 ~ 
Xn-i- Let i?i be the event that |^(1)| < y/25i log(p), for some <5i to deter¬ 
mine. From basic properties of the IV(0,1) distribution, P(Bfi) = L p p~ Sl 
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and E(\z(l)\ m I B -) = L p p Sl . It follows that 

E(\v'z^\ m ) < E(|z(l)| m l{z 2 (l) + Pf > n + 2y/qnlog(p), £?i}) + L p p ~ s1 

< F , (|z(l)| m l{||5|| 2 > n + 2y/qn log(p) - 2 v / 5 1 log(p)}) + Lpp _<51 
= £’(k( 1 )r) • ^(P|| 2 > n + 2 v / gnlog(p)(l + o(l))) + L p p~ Sl 
= E(\z(l)DMl + o(l)) + L p p- Sl . 

By choosing appropriately large, we find that the first term dominates. 
Consider E(\\z^ |[ 2m ). Denote by f n the density of Xn- where f n {y) = 

■P/*r(n/ 2 ) • Note tllat y m fn(y) = r ( (n / + 2 ^ fn+ 2 m{y) ■ It follows that 

E(\Xn lm ) = 2 '" r i l’ ( , ', / + 2 )” /2) P(xl+ 2 „ > n + 2 \/qn log(p)). 

First, by letting q = 0 on both hand sides, we have i% 2 m(n) = -F(|| 2 i|| 2m ) = 

2 V(n/ 2 )^ 2) - Second, since n+2y/qn log(p) = n*+2gn* log(p) (1+L p n“ x / 2 ) 

for n* = n + 2m, Lemma D.l implies that P(Xn+2m > n + 2 \/ Q n log(p)) = 
7To(l + o(l)). Together, the above right hand side is H 2 m{n)TTo{l + o(l)). 
Consider a p . Similarly to the above, for n* = n + 2, 

o p = ra -1 .E(||z (,) || 2 ) = ^)r(t/2) 2) p ( X "+ 2 > + 2\/gn* log(p)(l + L p n" 1/2 )) 

= F(x 2 +2 > n* + 2vV* log(p)(1 + L p n~ l/2 )) 

= 7T 0 (1 + L p vr l/2 ). 

Second, we prove (b). We first state an approximation of vri. From basic 
properties of chi-square distributions, for all q, r > 0, 

p (xl(0) >n + 2yJqn log(p)) = $(y/2qlog(p))(l + L p n~ 1/2 ), 
p (xl( 2 rlog(p)) > n + 2yfqn\og(p)) = $((y/q - y/r)y/2log(p)){l + L p rT l/i ). 

Therefore, we find that 

TTi = P(Xn( 2rl og(p)) > n + 2y / gnlog(p)) 

(D.48) = P(Xn(0) > 2 (Vq - Vr)^Jn\og{p)) • (1 + o(l)). 

Consider E[(v'z^) 2 ]. Fix v and introduce 

W1 = i/\\i\\, w 2 = (i- (^) 2 /iKii 2 )- 1/2 b - wmn 2 ]- 
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Both w i and W 2 are unit vectors and w[w 2 = 0. Let Q be any orthogonal 
matrix whose first two columns are w\ and W 2 - By direct calculations, Q'v = 

(®o, - %o, 0, •• • ,0)' and Q'l = {y/n, 0, ,0), where x 0 = {v'i)/\\i\\. 

Since Q'z and z have the same distribution, 

v'z^ = v'QQ'z ■ 1 {|| Q'z + n(j)Q'l \\ 2 > n + 2 y/qn log(p)}] 

= v'Qz • l{\\z + n(j)Q '£|| 2 > n + 2 yJqn log(p)}] 

(D.49) = [x 0 z(l) + (1 - xl) 1 , 2 z{2)] ■ l{\\z + y/nr*e i|| > n}. 

It follows that 

E[(v'zf) 2 ] = E[(x 0 z( 1) + (1 - Xo) 1 / 2 z(2)) 2 1{||z + y/nr*e i || 2 > n + 2 sjqn log(p)}] 
= i 1 ~ x o) E [( z ( 2 )) 2l {\\ z + V^^eiH 2 > n + 2 v / gnlog(p)}] 

+ Xq Li[( z( 1 )) 2 1 {||z + \/nT*ei || 2 > n + 2 ^ qn\og{p)}] 

= b p + (c p — b p )(v'£) 2 /\\£\\ 2 , 

where the second equality comes from the symmetry on z( 2 ) (so the cross 
term disappears). 

Consider b p and c p . Let z = (z( 2), • • • , z(n))', where ||5 || 2 ~ Xn-i an d it 
is independent of z( 1). We write 

c p = L;[( 2 ( 1 )) 2 1 {|| 5|| 2 > n + 2 y/qn log(p) - g(z( 1 ))}], g{x) = (x + y/nr*) 2 . 

For a constant 62 > 0 to be determined, let LL be the event that |z(l)| < 
yj 282 log(p). From basic properties of normal distributions, P^B^) = L p p ~ 52 
and E[z 2 (1)Ibc] = L p p~ 52 . Over the event B 2 , we have g(z( 1)) = [z{ 1) — 

(2 sjnr log(p )) 1 / 2 ] 2 = 2 yjrn log(p)(l + L p n -1 / 4 ). It follows that 

Cp < F;[(^(l)) 2 • P(B 2 n {p|| 2 > n + 2\/gnlog(p) — 5 r (^(l))}|^(l))] + L p p~ 52 

< E[(z( l)) 2 ] • P(Xn -1 > n + 2(y/q - y/r)y/nlog(p)(l + L p n^ 1/4 )) + 

= 7 ri(l + L p n _1/4 ), 

where the last inequality comes from (D.48) and that 62 is chosen appropri¬ 
ately large. To compute b p , we write 

b p = E[(z{2)) 2 l{\\z \\ 2 > n + 2y/qnlog{p) - g(z{l))}\ 

= (n - 1 ) _ 1 F;[||S|| 2 1 {P || 2 > n + 2 v / gnlog(p) - 5 ( 2 ( 1 ))}]. 
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Let B 2 be the same event. Let q* = [(y/q — y/r)+} 2 . We have 

b p = (n - l) _ 1 £ , [||z|[ 2 l{p || 2 > n + 2 yJ q*n log(p)(l + L p n~ 1/4 )}] + L p p ~ 52 
= (n - l)~ 1 £ , (||z (<i '* ) || 2 )(l + L p n~ 1/A ) = 7n(l + L p n^ 1/4 ), 

where in the last equality, we have applied the result in (a) with q = q*. 

Consider E(\v' z^\ m ). Let w = (z(3),--- ,z(n))'. Then ||«)|| 2 ~ Xn -2 an d 
it is independent of (z(l), z(2)). By (D.49), 

E{\v'z^\ m ) = E{\x 0 z(l) + (1 - xlf/ 2 z{ 2)| m 

• 1 {|H | 2 >n + 2 y/qn log(p) - g(z(l)) - (z( 2 )) 2 }). 

Let B 3 be the event that max{|z(l)|, |z(2)|} < y/2ds log(p). Then P(Bf) = 
L p p - S3 and over B 3 , g(z( 1)) + (z( 2 )) 2 = 2y / rnlog(p)(l + o(l)). Applying 
similar arguments as above, we find that 

E{\v'z^n < S(Ml) + (l-x 2 ) 1 /^(2)n-7r 1 (l + 0 (l)) = K m TTl(l + o(l)). 

Here the last inequality is because xqz(1) + (1 — xq) 1 ! 2 z(2) ~ iV(0,1). The 
claim then follows. 

Consider E(\\z^\\ 2 ) and E{\\z^\\ 2m ). Using Q defined above (for an ar¬ 
bitrary v) 

E(\\z^\\ 2 ) = E(\\Q'z\\ 2 l{\\Q'z + t;Q '£|| 2 > n + 2^qnlog(p)}) 

= E(\\z\\ 2 l{\\z + y/nr*e i || 2 > n-\- 2 yJqn log(p)}) 

= E((z(l)) 2 l{\\z + y/nr*e i || 2 > n + 2 y/qn log(p)}) 

+ (n - 1 )£’((^( 2 )) 2 1 {|| 2 ; + y/nr*e i || 2 > n + 2 yVi log(p)}) 

— c p T {n l)b p . 

Recall that z = (z(2), • • • , z(n))', q * = [{y/q—y/r )^] 2 and g(x) = ( x+y/nr *) 2 
for any x 6 R. Note that (x + y) m < 2 m (\x\ m + \y\ m ) for any x,y G R. We 
have 


E{\\z {q) \\ 2m ) 


= L'(||~|| 2 m l{||^ + y/nT*ei \\ 2 > n + 2 yjqn log(p)}) 

< 2 m E((z(l)) 2 m l{\\~z \\ 2 >n + 2\J qn log(p) - ^(z(l))}) 
+2 m E(\\z\\ 2 m l{\\z \\ 2 >n + 2y/qnlog(p)-g(z(l))}) 
= 2 m K 2 m 7r 1 (l + o(l)) + 2 m E(\\zM \\ 2m ) (1 + o(l)) 

= 2 71 i (K 2 m + « 2 m{n ~ 1)) ‘ 7 Ti(l + o(l)). 


(D.50) 
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Here, we have applied the result in (a) for E(\\z^ || 2m ) with q = q*. 

Last, we prove (c). Using the spherical symmetry of z ^ and the Q defined 

above, we have already seen that \\z + r*£\\ ( = || z + y/nr*e i|| and 

z-l{\\z + T*l\\ > n + 2^qn log(p)} = z-l{\\z +y/nr*ei\\ > n + 2 y / gnlog(p)}. 

Then the claims follow from the definitions and (a)-(b). □ 

D.2. Proof of Lemma B.5. Let ir(j) = ttq for j ^ S(/i) and n(j) = 7Ti 
for j 6 where 7 To, 7 Ti are defined in Section B.l. Then Yifj=\ n U) = m ^ 
by (c) of Lemma B.3. 

First, consider tr(fLo)- Write Mj = n _1 [||^ </) || 2 — E{\\z^ || 2 )]. By dehni- 
tion, 

v 

(D.51) n~ 1 tv(Ho) — mi^ ^ Mj. 

3 = i 

By Lemma B.3, E(\\z^ || 2 ) < mr(j) and E(\\z^\\ 2m ) < 2 m K2 m (n)7r(j) < 
C4 m Tr(j)n m , where K 2 m(n) is the m-th moment of the \' 2 distribution and 
we have used K 2 m (n) < C2 m n m . Noting that (a + b) m < 2 m (a m + b m ) for 
any real values a and b, by direct calculations, 

E(Mj) = 0, var(Mj) < Cir(j), E(\Mj\ m ) < C8 m . 

By Lemma B.l (Bernstein inequality), with probability at least 1 — 0(p~ 3 ), 

p /- 

(D.52) | ^ M i I < CymW log(p). 

3= 1 

Combining (D.51)-(D.52) gives the first claim. 

Second, consider ||iLo||f’- By direct calculations, 

(D.53) ||fL 0 |||-n- 1 [tr(H 0 )] 2 = £ z^] 2 - n~\Y \\ z ? f ) 2 

l<J>,fc<p J = 1 

= ^E n^ii 4 + 2 E 

j = l 1 

= (/) + (//). 

We now study (I). Write t/j = n ^ 1 || 4 for short. By Lemma B.3, E(Uj) = 

n _ 1 K 4 (n) 7 To(l+o(l)) and var(f/ ? ) < Cn 2 7 To for j ^ S(//); moreover, var(Lj) < 


[(* 




n 


(?) II21| (?) ||2 
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Cn 2/ Ki for j G S([/,). We also claim that E(Uj ) > n~ 1 K4(n — l) 7 Ti(l + o(l)) 
for j G S(fj-). The proof is similar to that for (D.50), but in the second line 
of (D.50), we instead use the inequality ||^|| 2m > ||z|| 2m . Note that K 4 (n) is 
the second moment of y 2 and so Ki{n) = n 2 + 2n. It follows that 

v p 

5 ]£( 1 / 3 )>to ( 9) , var(C/j) < Cn 2 mS q \ 

3 =1 i =1 

Using Lemma B.l, with probability at least 1 — 0(p ~ 3 ), Ej > nm^ q ’ — 
Cn^/m D) log(p) > Cnm^ q \ Since (I) = (n — 1) Y7j=i Ej, 

(D.54) (!) > Cin 2 m^ q \ for some constant C\ > 0. 

We then study (II). Let Vj k = [(z^) 1 z ^} 2 — ^||^ <y) || 2 ||zj^|| 2 . Introduce 

Wj(v) = [v'zj ^] 2 — n- 1 ||^ ) || 2 , for any v G 5 n_1 . 

Let Vj = 2 j/|| 2 j||. Then vj is independent of \\zj q '\\ and Vj k = \\z^ \\ 2 Wk(vj). 
By Lemma B.3, for any fixed v G 5 n_1 , 

E[Wj(v)] = 0, E((Wj(v)) 2 } < Ctt 0 , j^S(p), 

E(W j (v)) = (c p -b p )((v l £) 2 -n- 1 ], E[(Wj(v)) 2 } < Cn u j G S(p), 

where l = ^/||^||. As a result, if either j ^ S(p) or k ^ S(p), then E(Vjk) = 0; 
if both j,k G S(p), then E(V jk ) = (c p - b p )E{\\z { f ) || 2 [(v'T ) 2 - n -1 ]} = 
(c p — b p )E[Wj(i)] = (1 — ?r~ 1 )(c p — b p ) 2 > 0. It follows that 

(D.55) E[(II)] > 0. 

To compute var((//)), we calculate E(Vj k Vji k i) for all (j, k. j'. k') such that 
j k and j' / k !. Since Vj k = 14j, we assume j 7 ^ k! and j' / k with¬ 
out loss of generality. We have the following observations: ( 1 ) E(Vj k ) < 
( c p — bp ) 2 if both j, k G S(p) and E(Vj k ) = 0 otherwise. (2) E(V 2 k ) = 

E(\\Zj^\\ 2 )E[W k (vj )] < Cn 2 Tr(j)ir(k) for any j ^ k. (3) When j ^ j' and 
k / k ', Vjk is independent of Vj> k so E(Vj k Vj' k ') = E(Vj k )E(Vj / k ')• (4) 
When j = / and k ^ k\ E(V jk V jk f) = E[\\z^ \\*W k (vj)W k '(vj)]; as a re¬ 
sult, E(Vj k Vj k ') = 0 when either k ^ or k' ^ S(/r); if k,k' G S^/r), 
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E(Vj k Vjk') = ( c p - bp) 2 E[(Wj(l)) 2 } < C(cp - b p ) 2 n(j). Therefore, 

E{V jk V r w )< Cn 2 ir(j)n(k) + C(c p - b p ) 2 Tr(j) 

(j,j',k,k'): (j,k):j^k (j,k,k'):jg{k,k'} 

j^k,j'^k' {k,k'}cS(/j,),kj^k' 

+ J2 (°p~ b p ) 4 

{j,f ,k,k'):{j,j' ,k,k'}cS(n) 
j,j',k,k f are different 

< Cn 2 (m ^) 2 + C(c p - b p ) 2 m^\S(fj,)\ 2 + C(c p - b p ) 4 \S(p)\ 4 

< Cn 2 (m ^) 2 + m ( ^(|S(^)| 7 Ti ) 2 • o(l) + C(\S(p)\ttx) 4 • o(l), 

where the last inequality is due to that c p — b p = Using (B.15), when 

r < Pg(/3) (“impossibility”) and q < q(/3,0,r) (“fat” case), (|<S(/i)| 7 ri ) 2 = 
o(rrv- q ') and so the first term in the above dominates the other two. It implies 

(D.56) var((/J)) < C^n 2 (m^) 2 , for some constant C 2 > 0. 

We combine (D.55)-(D.56) and apply the Markov inequality. It follows that 
with probability at least 1 — 4 n~ 2 C2/Cf, 

(D.57) (//) >-Cin 2 m (?) / 2 . 

The second claim follows by plugging (D.54) and (D.57) into (D.53). □ 

D.3. Proof of Lemma C.l. The proof is similar to that of Lemma 2.1. 
By Lemma B.4, there exists an (l/4)-net of 5 n , denoted as such 

that |A4*/ 4 | < 9 n and sup veM *^v'Av > 2||M|| for any n x n matrix A. 
Therefore, to show the claim, it suffices to show that for each fixed v £ A4^y 4 , 
with probability > 1 — 0(9 -n p -2 ), 

(D.58) \v'(ZBB'Z' - tr(BB')I n )v\ < C^/np. 

Denote the eigenvalue decomposition of BB 1 by V'AV, where A is diag¬ 
onal matrix with diagonals Ai > A 2 > ■ ■ ■ > A p . Fix v, we can write 

p 

v'ZBB’Z’v = v'ZV'AVZ'v = X^ 2 , r h ~ N( 0,1). 

2=1 

The last equation comes from VZ'v ~ IV(0, I p ). So we have E[v'ZBB'Z'v] = 
tr (BB') for any fixed v with ||u|| = 1. Let Wj = Xirjj/X\ — A,;/A 1 , then Wj ’s 
are independent of each other, E(Wj) = 0, var (Wj) < 2 and E(\Wj\ m ) < 
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K 2 m- We apply Lemma B.l with A = 2y / nlog(9) + 21og(p). To check the 
moment conditions, we note that K 2 m = -£'z~iv(o,i)(M 2m ) < 2 m m\ for all 
m > 1. It follows that with probability > 1 — 0(9~ n p~ 2 ), 

p 

A ii 5Z Wj\ < 2y / nplog(9) + 2plog(p)Ai < C^/np. 

3= 1 

The last inequality is because Ai = ||B || 2 < L p . This proves (D.58). □ 


D.4. Proof of Lemma D.l. We start from computing -kq. Using the 
density of the \' 2 distribution, 


ttq 


L 


n+2yjqn log(p) 


x n l 2 1 e x / 2 dx 
2 n/ 2 r ( n / 2 ) 


1 

2 n / 2 T(n/2) 


■GO- 


Now, we calculate the integral (I). Write for short 

t = \/2q log(p) and xq = n + \/2 nt. 
With a variable change x = n + \/ 2ny , we have 


(I) = y/2nx^ 2 1 e x °/ 2 

= \[2nxl /2 ~ l e~ x ° 12 
(D.59) 

= \/2nx^ 2 ~ l e~ x ° / 2 


[°°(1 + Vfrty/xo^-'e-^v^dy 
Jo 

j exp |(n /2 - l)log(l + \plny/x 0 ) - V2ny/2^dy 


[(/l) + (/ 2 ) + (/3)], 


where (I i) contains the integral from 0 to ct, (I 2 ) contains that from ct to 
xo/y/2n and (I 3 ) contains that from xq/\/ 2n to infinity. We will determine 
the constant c > 0 later. 

Consider (/ 1 ). From the Taylor expansion, log(l+o) = a— a 2 /2+0(a 3 ) for 
small a. Moreover, n/x 0 = 1 — y/2nt/xo + 0(t 2 /n), xq = 0(n) and y = 0(t). 
As a result, for 0 < y < ct, by simple calculations, 

(n /2 - 1 ) log(l + V2ny/x 0 ) - V2ny/2 = -ty - y 2 /2 + 0(t 3 /y/n). 

Noting that e a = l + 0(a) for small a, so (/ 1 ) is equal to / Q ct e~ ty ~ y2 ^ 2 dy ■ [1 + 
0{t 3 /yfn)}. By direct calculation, e~ ty ~ y "J 2 dy = e * 2 / 2 // 1+c ^ e~ y2 / 2 dy = 
v / 27re t2 / 2 [4>(f) —4>((l + c)t)]. By Mills’ ratio, $(t) = L p p~ q and , F((l + c)t) = 
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Lpp ( 1+c ) 2< ?. Therefore, when c is chosen large enough, <b((l + c)t) = o(l) • 
<§(t)L p n~ 1 / 2 . It follows that 

(D.60) (h) = V2-ne t 2 / 2 $(t) (l + L p /y/n ). 

Consider (. I 2 ). Since log(l + a) — a < a 2 /4 for a € [0,1], when ct < y < 
x 0 /\/ 2 n, 

(n /2 - 1 ) log(l + V 2 ny/x 0 ) - V 2 ny /2 = —ty - y 1 /4 + 0 (teo/(\/n) 3 ) 

As a result, ( I 2 ) < (1 + L p n ~ 1 / 2 ) J // 3 e~ yt ~ y 2 ^ 4 dy, where e~ yt ~ y2 / 4 dy = 
2 A / 7 re t “<i > ((c + 2)t/y/2) = e t 2 ^ 2 L p y/np~^ c+2 ' )2 ^ 2 ~ 1 ^ q . By choosing c appropri¬ 
ately large, we have 

(D.61) ( h) = o(l) • e i 2 / 2 L p n -1 / 2 . 

Consider (I 3 ). Since log(l + a) < t for all a > 0, when t > xq/ y/2n, 

(n /2 - 1 ) log(l + y/ 2 ny/xo) - V 2 ny /2 < —(n/xo)ty < -ty/ 2 . 

It follows that (I 3 ) < J^/^/z^e-^^dy = ( 2 /i)e _a: o/( 2n ) = o(l) -e / 2 l 2 L p n~ 4 l 2 . 
Combining the above results for (/i)-(/ 3 ), we obtain that 

, 2,fKriT n l‘ 2 - 1 p- x °/ 2+t2 ! 2 

vr 0 = R n {t)-<S>{t)(l + L p n~ 1/2 ), where R n {t) = - 2"/ 2 T(n/2) -' 


We plug in xo = n+y/2nt and rewrite R n (t) = ^ '^r{n/ 2 ) ' (^+t\/2/n) n ^ 2 e~ t ^^ 2+t2 ^ 2 . 
Note that by Taylor expansion, log(l + o) = a — a 2 /2 + 0(a 3 ) for a = t\j2/n. 

Therefore, we have R n (t) = ^ ^(n/i) 7 ex P { O (t 3 /y/n)} = 1 + L p n^ 1 / 2 . 

This gives 

vr 0 = &(t)(l + L p rT l/2 ). 

Next, we compute 7Ti. Define 


~ = (^(1) - y/^Tp) 2 

2 x /nlog(p) 


1=2 


Then f and W are independent; furthermore, W has a Xn-i distribution. 
We rewrite 


7Tl = E 


P 


W-n 
\[2n 


> (Vq~ V^)a/2 log(p) 
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For a constant c > 0 to be determined, let B\ be the event that |z(l)| < 
\j2c log (p). Then P(B±) = L p p~ c . Over the event B\, r = r + L p ri ~ l / 4 . 
When r > q, utilizing the results for ttq, we get 


tti = $((\/r- y/q + o(l))y/2 log(p)) (1 + L p n 1/2 ) + L p p c 
= 1 - L p p~( ^-V ?) 2 + L pP ~ c . 


When r < q. 

7 Ti = mVQ~ Vr + L p n~ 1/A )^2 log(p))(l + L p n _1/2 ) + L p p~ c 

= Vr)V 21 og{p)){l + L p rr l/i ) + L p p~ c . 

We choose c large enough so that L p p~ c is always dominated by any other 

term. This gives the claim for -k\. □ 
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